How Equipment Troubleshooting and Root Cause Analysis Work Together
Troubleshooting and RCA Come Together
Back in 1997, I had reliability engineers tell me that root cause analysis didn’t work. They tried it, but they didn’t find fixes that would stop repeat equipment failures. I didn’t understand their observation. It seemed to me that they should be able to get to the root causes of these equipment issues just like root cause analysis of any other problem.
I decided to work with an equipment reliability expert and see if I could understand their problems and come up with a solution. That led me to equipment reliability expert Heinz Bloch.
Heinz had been an expert equipment troubleshooter for Exxon before he retired and started his consulting practice. Once he retired, he wrote dozens of books about equipment reliability, equipment troubleshooting, and machinery design and lubrication. This included his book Machinery Failure Analysis and Troubleshooting: Practical Machinery Management for Process Plants.
What did I learn in my discussions with Heinz? The problem that reliability engineers were having with root cause analysis is that they weren’t completing a thorough troubleshooting of the equipment failure before they tried to identify the equipment problem’s root causes.
What did Heinz learn in his discussions with me? That there were advanced root cause analysis methods that were more effective than the common 5-Whys, Fishbone Diagrams, and Cause and Effect.
Heinz calculated the cost of failure to effectively troubleshoot equipment problems and find and fix the problem’s root causes at thousands to millions of dollars in needless repairs and equipment downtime. When these costs were summed over a large corporation’s facilities around the world, the cost could be hundreds of millions of dollars per year.
That motivated us to take action. We decided to work together to create a system for thorough troubleshooting of equipment problems and advanced root cause analysis of the reliability issues that would lead to effective corrective actions. This article explains the result of our work.
Thorough Equipment Troubleshooting
To develop an effective troubleshooting system, Heinz and I started with the troubleshooting tables that he successfully used and that he had included in his book Machinery Failure Analysis and Troubleshooting: Practical Machinery Management for Process Plants. This provided a basis for developing a computerized troubleshooting technique that we called the Equifactor® Troubleshooting Tables.
The troubleshooting tables were divided into four topics:
- Equipment (pumps, compressors, fans, blowers, engines, electric motors, refrigeration, and conveyor belts)
- Manual Valves (ball valves, butterfly valves, diaphragm valves, pinch valves, globe valves, gate valves, plug valves, and genre valve troubleshooting)
- Components (bearings, gears, gear couplings, and mechanical seals)
- Electrical (resistors, cable insulation, switches, fuses, breakers, capacitors, terminals/joints, transformers, diodes and semiconductors, and integrated circuits)
Each of these were broken down into an exhaustive list of failure symptoms and, for each symptom, a list of potential causes.
If these troubleshooting tables didn’t provide an answer, one could use Heinz’s two other troubleshooting methods, Failure Modes and Failure Agents, to develop a better understanding of the equipment issue.
Below is a graphic of the Equifactor® Troubleshooting Tables.
The techniques are explained in more detail in the example that follows.
From Troubleshooting to Root Cause Analysis
Thorough troubleshooting provided the information needed to start a root cause analysis. Without thorough troubleshooting, the reliability expert was working blind. That’s why they thought that root cause analysis didn’t work. With the knowledge gained from identifying the potential cause that led to the failure (and eliminating the other potential causes that did not lead to the failure), the reliability professional could now identify the failure’s root causes using an advanced root cause analysis system.
The advanced root cause analysis system we chose to use is the TapRooT® Root Cause Analysis System which includes the Root Cause Tree® Diagram. The TapRooT® System is described in more detail in the example below.
Example: Pump Fails to Pump Rated Flow
In this example, a pump that was vibrating excessively was removed, rebuilt, and reinstalled in the system. However, when the pump was tested, the vibration was gone, but it only provided 70% of the previously rated flow. The question for the reliability engineer was what was wrong and what was the root cause (or root causes) of the problem.
This example uses the Equifactor® Six Step Process for troubleshooting and finding the root causes of the pump not being able to provide the rated flow. The process is shown below.
Normally, the process starts with the analyst drawing a SnapCharT® Diagram of what they know. For this example, the SnapCharT® Diagram they initially drew is provided below.
Next, they start the troubleshooting process by opening the centrifugal pump troubleshooting table and selecting the insufficient capacity symptom. That symptom, shown below in the computerized Equifactor® Troubleshooting Table, provides a list of possible causes.
For this example, those 25 potential causes are what the analyst needs to either verify or eliminate.
At this point, we recommend developing a troubleshooting checklist that starts with the easiest potential causes to select or eliminate (those that don’t require pump removal and disassembly) and finishes with the causes that require pump removal and disassembly. An example of this type of checklist is provided below.
Answering these questions should lead the troubleshooter to the cause of the problem. In this case, they found the impeller (a double suction/double volute impeller) was installed backward.
The information gained is added to the SnapCharT® Diagram shown below.
The information in the SnapCharT® Diagram is used to identify the problem’s Causal Factors using a Causal Factor Worksheet (a portion of a blank worksheet is shown below).
Using the worksheet (under question 2 above), the Causal Factor identified was “Mechanic installed impeller backward.”
Notice that what started out to be an equipment problem (pump not pumping rated flow) is actually a human performance problem (mechanic installed impeller backward). Thus, without thorough troubleshooting, there was no chance to identify the problem’s root causes.
Next, the Causal Factor (mechanic installed impeller backward) is analyzed using the TapRooT® Root Cause Tree® Diagram, including the Human Performance Troubleshooting Guide and the applicable Basic Cause Categories.
The first question in the Human Performance Troubleshooting Guide (about fatigue and impairment) is shown below. There are 15 of these questions to guide the investigation.
The Root Cause Tree® Diagram and the associated Root Cause Tree® Dictionary provide a comprehensive set of questions that help the analyst identify the fixable root causes of human performance issues. In this case, the analyst would be guided by the Human Performance Troubleshooting Guide to look at procedure use, quality control, human engineering, management system, and work direction.
An example of the Procedure Basic Cause Category from the back side of the Root Cause Tree® Diagram is shown below. This is one of the seven Basic Cause Categories that could be indicated for analysis using the Human Performance Troubleshooting Guide.
Depending on the answers to the questions and the root causes selected, the analyst might decide to require:
- a written procedure (that includes a caution about installing the impeller backward) to be used for this type of job,
- a quality control inspection after a double volute impeller is installed on the shaft,
- the manufacturer develop a keyway that only allows the impeller to be installed in the correct direction, and/or
- the supervisor to verify the correct installation of a double volute impeller before the pump is reassembled.
Cost of Failures
What was the cost of this pump issue (failure to pump rated flow)? The costs associated with the maintenance time and effort to troubleshoot the problem (remove, disassemble, reassemble, and reinstall the pump). Use your fully burdened pipefitter, mechanic, and engineer’s costs to calculate what the cost would be at your facility. Then consider the downtime (if any) required for this troubleshooting and repair and the cost of lost production. The answer could vary significantly if the downtime for your plant is significant. But, at a minimum, the costs are $4,000 dollars or more.
In a different example on an offshore gas production platform, the failure of a downhole pump that supplied cooling water to the platform’s processes cost over $50,000 for each repair. On average, the pump was failing six times each year. That’s $300,000 per year of needless repair costs because the maintenance personnel were not effectively troubleshooting this one pump’s bearing problem and fixing the problem’s root causes.
Let’s continue. Consider the company’s other platforms, gas and oil processing equipment offshore and onshore, and the other refining and chemical equipment this large multinational corporation operated. How many millions of dollars were being wasted each year by not effectively troubleshooting and fixing equipment issues? Millions? Tens of millions? For example, a TapRooT® User saved $40 million in just two years by eliminating frequent unplanned shutdowns on an acrylates oxidation reactor (see the success story at this link https://www.taproot.com/using-taproot-to-improve-process-reliability/).
Surely the potential cost savings that could be achieved are worth the cost of training reliability engineers, maintenance managers, and maintenance technicians to effectively troubleshoot and find the root causes of equipment reliability issues.
But this isn’t the only reason to consider implementing effective troubleshooting and advanced root cause analysis. Equipment failures and unplanned maintenance can lead to even more significant issues. These issues could include injuries while performing unplanned maintenance or explosions and fires due to loss of containment of flammable, hazardous materials when machinery fails.
That’s why effective troubleshooting paired with advanced root cause analysis is essential to save your company money and prevent serious incidents.
We were proud to work with Heinz Bloch to develop and improve the Equifactor® Troubleshooting Tools and Techniques. Heinz passed away in 2022, but we will continue his legacy by teaching equipment professionals the tools he developed.
The figures in this article are copyrighted by System Improvements, Inc, and are used here by permission. Duplication or use in any other form are prohibited. For more information about the Equifactor® Troubleshooting Techniques and TapRooT® Root Cause Analysis, see https://www.taproot.com and https://www.taproot.com/equifactor/
Mark Paradies is the President of System Improvements and worked with Heinz to develop the Equifactor® Troubleshooting Techniques. He has a BS in Electrical Engineering, and MS in Nuclear Engineering, and, as a Navy Nuke, was certified as an Engineer by Navsea 08. He has worked to develop advanced root cause analysis techniques (including writing 15 books on root cause analysis and holding a patent on root cause analysis software) and has applied root cause analysis to human performance and equipment problems in a wide variety of industries for over 35 years.
Justin Clark is a Strategic Advisor for System Improvements and is in charge of improving and teaching the Equifactor® Troubleshooting Techniques and Training and teaching the TapRooT® Root Cause Analysis System. He is also the Track Leader for the Equipment Reliability Track at the 2023 Global TapRooT® Summit (see https://www.taproot.com/summit/). He has a BS in Mechanical Engineering, an MS in Engineering Management, and, as a Navy Nuke, was certified as an Engineer by Navsea 08.