Why Some Root Cause Investigations Don’t Prevent Recurrence

Randall Noon, www.mt-online.com

In the nuclear power industry, the primary mission of a root cause investigation is to understand how and why a failure or a condition adverse to quality has occurred so that it can be prevented from recurring. This is a good practice for many reasons—and a lawful requirement mandated by 10CFR50, Appendix B, Criterion XVI.

To successfully carry out this mission, a root cause investigation needs to be evidence-driven in accordance with a rigorous application of the bedrock of all root cause methodologies: the Scientific Method. Consistent with the Scientific Method, underlying assumptions have to be questioned and conclusions have to be consistent with the available evidence, as well as with proven scientific facts and principles.

Sometimes root cause investigations fail to fulfill their primary mission and the failure recurs. In that regard, diagnosing the root cause of root cause investigation failures is, in itself, an interesting topic. Here are three common reasons why some root cause investigations fail their mission.

Reason #1: The Tail Wagging the Dog

As a root cause investigation proceeds and information about the failure event accumulates, some initial hypotheses can be readily falsified by the preliminary evidence and dismissed from consideration. The diminished pool of remaining hypotheses will likely have some attributes in common. More work is then usually needed to uncover additional evidence to discriminate which of the remaining hypotheses specifically apply.

At this point in the investigation, it may become apparent what the final root cause might be—especially if the remaining pool of hypotheses is small and they all share several important attributes. At the same time, it also becomes apparent what the corresponding corrective actions might be.

By anticipating which corrective actions are more palatable to the client or management, the investigator may begin to unconsciously—or perhaps even consciously—steer the remainder of the investigation to arrive at a root cause whose corresponding corrective actions are less troublesome.

Evidence that appears to support the root cause and lead to more palatable corrective actions is actively sought, while evidence that might falsify the favored root cause is not actively sought. Evidence that could falsify a favored root cause may be dismissed as being irrelevant or not needed. It may be tacitly assumed to not exist, to have disappeared or to be too hard or too expensive to find. It may even just be ignored because so much evidence already exists to support the favored root cause that the investigator presumes he already has the answer.

In logic, this is defined as an a priori methodology. This is where an outcome or conclusion is decided beforehand, and the subsequent investigation is conducted to find support for the foregone conclusion. In this case, the investigator has decided what corrective actions he wants based on convenience to his client or management. Subsequently, he uses the remainder of the investigation to seek evidence that points to a root cause that corresponds to the corrective actions he desires.

Here is an example: A close-call accident involved overturning a large, heavy, lead-lined box mounted on a relatively tall, small-wheeled cart. The root cause investigation team found that the box and wheeled cart combination was intrinsically unstable. The top-heavy cart easily tipped when the cart was moved and the front wheels had to swivel, or when the cart was rolled over a carpet edge or floor expansion joint.

The investigation team also found that the personnel who moved the cart in the course of doing cleaning work in the area had done so in violation of an obviously posted sign. The sign stated that prior to moving the cart a supervisor was to be contacted. The personnel, however, inadvertently moved the cart—without contacting a supervisor—in order to clean under and around it.

The easy corrective actions in this case would be to chastise the personnel for not following the posted rules and to strengthen work rule adherence through training and administrative permissions. There is ample evidence to back-fit a root cause to support these actions. Also, such a root cause finding—and its corresponding corrective actions—are consistent with what everyone else in the industry has done to address the problem, as noted in ample operational experience reports. In the nuclear power industry, the “bandwagon” effect of doing what other plants are doing is very strong.

In short, the aforementioned corrective actions are attractive because they appeal to notions of personal accountability, are cheap to do and can quickly dispose of the problem. Consequently, the root cause of the close-call accident was that the workers failed to follow the rules.

Unfortunately, when the cart and box combination is rolled to a new location, the same problem could recur. The procedure change and additional training might not have fixed the instability problem. While the new administrative permissions and additional training could reduce the probability of recurrence, they would not necessarily eliminate it. When the cart is rolled many times to new locations, it is probable that the problem will eventually recur and perhaps cause a significant injury. This situation is similar to the hockey analogy of “shots on goal.” Even the best goalkeeper can be scored upon if there are enough shots on goal.

Reason #2: Putting Lipstick on a Corpse

In this instance, a failure event has already been successfully investigated. A root cause supported by ample evidence has been determined. Vigorous attempts to falsify the root cause conclusion have failed. Ok…so far, so good.

On the other hand, perhaps the root cause conclusion is related to a deficiency involving a friend of the investigator, a manager known to be vindictive and sensitive to criticism or some company entity that, because of previous problems, can’t bear criticism. The latter could include an individual that might get fired if he is found to have caused the problem, an organization that might be fined or sued for violating a regulation or law or a department that might be re-organized or eliminated for repeatedly causing problems. In other words, the root cause investigator is aware that the actual consequences of identifying and documenting the root cause may be greater than just the corrective actions themselves.

When faced with this dilemma, some investigators attempt to “word-smith” the root cause report in an effort to minimize perceived negative findings and to emphasize perceived positive findings. Instead of using plain, factually descriptive language to describe what occurred, less precise and more positive- sounding language is used. This is called “word-smithing” a report.

“Word-smithed” reports are relatively easy to spot. Instead of using plain modifiers like “deficient” or “inadequate” to describe a process, euphemistic phrases like “less than sufficient” or “less than adequate” are used. Instead of reporting that a component has failed a surveillance test, the component is reported to have “met 95% of its expected goals.” Likewise, instead of reporting that a fire occurred, it is reported that there was a “minor oxidation-reduction reaction that was temporarily unsupervised.”

In such cases, the root cause report becomes a quasi-public relations document that sometimes has conflicting purposes. Since it is a root cause report, its primary purpose is supposed to be a no-nonsense, fact-based document that details what went wrong and how to fix it. However, a secondary, perhaps conflicting, purpose is introduced when the same document is used to convince the reader that the failure event and its root cause are not nearly as significant or serious as the reader might otherwise think.

With respect to recurrence, there are two problems with “word-smithing” a root cause report. Corrective actions work best when they are specific and targeted. A diluted or minimized root cause, however, is oft en matched to a diluted or minimized corrective action. There is a strong analogy to the practice of medicine in this instance. When a person has an infection, if the degree of infection is underestimated, the medicine dose may be insufficient and the infection may come back.

The second problem is that by putting a positive “spin” on the problem, management may not properly support what needs to be done to fix the problem. Thus, the report succeeds in convincing its audience that the failure event is not a serious problem.

Reason #3: Elementary My Dear Watson

In some ways, root cause investigations are a lot like “whodunit” novels. Some plant personnel simply can’t resist making a guess about what caused the failure in the same way that mystery buffs often try to second guess who will be revealed to be the murderer at the end of the story. It certainly is fun for a person—and perhaps even a point of pride—if his/her initial guess turns out to be right. Unfortunately, there are circumstances when such a guess can jeopardize the integrity of a root cause investigation.

The circumstances are as follows:

The guess is made by a senior manager involved in the root cause process.
The plant has an authoritarian, chain-of-command style organization.
The management culture puts a high premium on being “right,” and has a zero-defects attitude about being “wrong.” the scenario goes something like this:
A failure event occurs or a condition adverse to quality is discovered.
Some preliminary data is quickly gathered about conditions in the plant when the failure occurred.
From this preliminary data, a senior manager guesses that the root cause will likely be x, because:
- (1) he/she was once at a plant where the same thing occurred; or
- (2) applying his/her own engineering acumen, he/she deduces the nature of the failure from the preliminary data, like a Sherlock Holmes or a Miss Marple.
Not being particularly eager to prove their senior manager wrong and deal with the consequences, the root cause team looks for information that supports the manager’s hypothesis.
Not surprisingly, the teams find some of this supporting information; the presumption is then made that the cause has been found and field work ceases.
A report is prepared, submitted and approved, possibly by the same senior manager that made the Sherlockian guess.
The senior manager takes a bow, once again proving why he is a senior manager.

The deficiency in this scenario that can lead to recurrence is the fact that falsification of the favored hypothesis was not pursued. Once a cause was presumed to have been found, significant evidence gathering ceased. (Why waste resources when we already have the answer?) As a result, evidence that may have falsified the hypothesis, or perhaps supported an alternate hypothesis, was left in the field. Again, this is another example of an a priori methodology: where the de facto purpose of the investigation is to gather information that supports the favored hypothesis.

In this regard, there is a famous experiment about directed observation that applies. Test subjects in the experiment were told to watch a volleyball game carefully because they would be questioned about how many times the volleyballs would be tipped into to air by the participants. This they did.

In fact, the test subjects did this so well, they ignored a person dressed in a gorilla suit who sauntered through the gaggle of volleyball players as they played. When the test subjects were asked about what they had observed, they all reported dutifully the number of times the ball was tipped but no one mentioned the gorilla. When they were told about the gorilla, they were incredulous and did not believe that they had missed seeing a gorilla…until they were shown the tape a second time. At that point, they all observed the gorilla.

Randall Noon is a root cause team leader at Cooper Nuclear Station in Brownville, Nebraska. He is a licensed professional engineer in the United Sates, as well as Canada, and has been investigating failures for over 30 years. Randall Noon can be contacted at [email protected].

Analyzing Semiconductor Failure

Semiconductor devices are almost always part of a larger, more complex piece of electronic equipment. These devices operate in concert with other circuit elements and are subject to system, subsystem and environmental influences. When equipment fails in the field or on the shop floor, technicians usually begin their evaluations with the unit's smallest, most easily replaceable module or subsystem. The subsystem is then sent to a lab, where technicians troubleshoot the problem to an individual component, which is then removed--often with less-than-controlled thermal, mechanical and electrical stresses--and submitted to a laboratory for analysis. Although this isn't the optimal failure analysis path, it is generally what actually happens.

Improvement: What Comes First?

I use the term RCPE because it is a waste of good initiatives and time to only find the root cause of a problem, but not fixing it. I like to use the word problem; a more common terminology is Root Cause Failure Analysis (RCFA), instead of failure because the word failure often leads to a focus on equipment and maintenance. The word problem includes all operational, quality, speed, high costs and other losses. To eliminate problems is a joint responsibility between operations, maintenance and engineering.

An Integrated Process for System Maintenance, Fault Diagnosis and Support

This paper presents an overview of an integrated process for system maintenance, fault diagnosis and support. The solution is based on Qualtech System, Inc.’s (QSI’s) TEAMS toolset for integrated diagnostics and involves several key innovations. As a showcase of the integrated solution, QSI, along with Antech Systems and Carnegie Mellon University (CMU), have recently completed a research project for the Information Technology Branch at the Naval Air Warfare Center–Aircraft Division (NAWC-AD) in St. Inigoes, MD. The entire system, termed ADAPTS (Adaptive Diagnostic And Personalized Technical Support), provides a comprehensive solution to integrated maintenance and training.

Anatomy of a Boiler Failure—A Different Perspective

The power industry’s operating and maintenance practices were held up to intense regulator and public scrutiny when on November 6, 2007, a Massachusetts power plant’s steam-generating boiler exploded and three men died. The Department of Public Safety’s Incident Report investigation determined that the primary cause of the Dominion Energy New England’s Salem Harbor Generating Station Unit 3 explosion was extensive corrosion of boiler tubes

Anatomy of a Hydraulic Pump Failure

I was asked recently to give a second opinion on the cause of failure of an axial piston pump. The hydraulic pump had failed after a short period in service and my client had pursued a warranty claim with the manufacturer. The manufacturer rejected the warranty claim on the basis that the failure had been caused by contamination of the hydraulic fluid. The foundation for this assessment was scoring damage to the valve plate.

Are We Willing to Hear What “Failure” Has to Say?

Root Cause Analysis has the potential of CHANGING people, IF the leader of the investigation knows of this potential. Far from “just another problem-solving exercise,”the root cause analysis should SLOW PEOPLE DOWN to the extent that they can see the truth of the incident under inquiry, WHATEVER THE TRUTH MIGHT BE. This paper focuses on two parts of our human nature which are large obstacles to root cause discovery, i.e., our unwillingness to slow down, and our unwillingness to let go of certain basic assumptions about life. Warning: This paper is designed to challenge the way you think about Root Cause Analysis.

Definition of Root Cause Analysis (RCA)

A fault tree is constructed starting with the final failure and progressively tracing each cause that led to the previous cause. This continues till the trail can be traced back no further. Each result of a cause must clearly flow from its predecessor (the one before it). If it is clear that a step is missing between causes it is added in and evidence looked for to support its presence. Below is a sample fault tree for the moral story of the kingdom lost because of a missing horseshoe nail.

Why Some Root Cause Investigations Don’t Prevent Recurrence