Equipment Troubleshooting Tip: “Tolerable Failure”
One of the root causes on the front of the Root Cause Tree® under “Equipment Difficulty” is “tolerable failure.” This root cause is available to the incident investigator when the evidence on the SnapCharT® indicates that a relatively unimportant piece of equipment has failed, and the cost of maintaining the equipment is less the the value of the gear itself. Let’s take a look at what goes into this determination.
The Dictionary® discusses the decision to select this root cause. It assumes you have analyzed the particular equipment failure, and you have decided that it is OK for this gear to just break. There are no real corrective actions required, and no human error involved. You accept this failure as low risk, low cost, and not worth the effort to track its performance or perform some periodic maintenance. There is nothing wrong with selecting this root cause, but there are some factors you should consider:
1. You must ensure you have properly weighed the cost and risk of randomly losing this piece of equipment. For example, the “cost” of losing this gear must include:
– the effect on system production when the gear is lost
– time and effort expended to troubleshoot and find the failure
– man-hours required to replace the equipment
– the cost associated with retesting the new installation
– the increased frequency of failure due to low monitoring intervals
– the physical cost of the replacement
2. You must then balance this against the cost of performing an analysis of the equipment failure. This cost may include:
– the cost of predictive maintenance monitoring equipment
– man-hours expended on preventive and predictive maintenance programs
– man-hours and effort expended during the root cause failure analysis (when they occur)
3. Multiple failures of equipment determined to be “tolerable” may be indicative of other more insidious system problems. Even tolerable failures may need to be thoroughly analyzed to determine if system performance is degrading over time, possibly leading to more serious failures or equiment reliability problems in the future.
Bottom line: The determination that the failure of a particular piece of equipment is “tolerable” is not necessarily a quick and easy process. It should be well thought-out and thoroughly analyzed, probably by a relatively senior maintenance manager. During an investigation, if you think you may be selecting “tolerable failure” as your root cause, I recommend you have a detailed discussion with management to ensure you are all in agreement.