Root Cause Analysis Chronic Events: Panning For Gold

Robert J. Latino, Reliability Center, Inc.

Root Cause Analysis: Chronic Events vs. High Visibility Events

When we look at the widely used and misunderstood tool of root cause analysis (RCA), we should reflect its interpretation in our own environments. Think about it: when is RCA typically requested and applied in your environment? Based on my experience, it is requested and applied when:

Someone is injured
There is a catastrophic damage
There is an environmental incident
There is a “near miss”
There is public scrutiny over an issue at the site
There is a quality issue that a customer is complaining about

What do all of these issues have in common? They are high visibility events that require immediate action at the request of authority. Usually in these circumstances, resources, time, and money are not an issue because of the level of management that is requesting the analyses be done. While Root Cause Analysis needs to be, and will be, done under these circumstances, it is not the optimal use of such a disciplined methodology.

Utilizing a modified version of failure modes and effects analysis (FMEA), consider a one-time fire at a facility that results in $500,000 worth of damages. Costs such as these are unanticipated and not part of the budget, yet we almost always find the cash to recover. The accountants typically will use creative techniques to soften the blow such as amortizing the cost of the event over a 20-year period. The resulting impact would be viewed as $25,000/yr which is much more acceptable. Using the modified FMEA format, such a line item might look like the top item in the accompanying section “Comparative Impact of Failure Events”.

Now consider a chronic event such as conveyor belts that trip in a mining operation. On their individual impact they may take 15 minutes to reset. This 15-min period requires the attention of a person, which at a typical standard rate ($40/hr with benefits included) results in a cost per event of $10 (0.25 hr x $40/hr labor rate).

Because the event simply requires a person to find and reset the tripped conveyor system, generally no additional parts costs are involved. However, the 15-min delay causes a production loss upstream in the processing area, which equates to $5000/hr. Fifteen minutes now is worth $1250/occurrence (0.25 hr x $5000/hr production loss). So each 15-min occurrence is now worth $1260 ($10 labor + $1250 lost production). Still considered a relatively low impact, right?

Now consider on this particular conveying system, we experience 40 such stoppages a week or 2080 for the year. Now we are looking at an annual impact to the bottom line of $2,620,800 ($1260/occurrence x 2080 occurrences).

The chronic event is approximately 100 times more costly, yet which event gets the most attention – the one-time fire or the continual tripping of a conveyor system? We all know the answer; the fire gets the attention because it is highly visible and requires urgent response. The chronic event has been accepted as a cost of doing business and is considered part of the job. Herein lies the problem. Chronic events are never aggregated on an annual basis. They are typically viewed on their individual impacts.

Consider if we were to apply this modified FMEA format to an operation, a process, or a facility. We would seek out these hidden “nuggets” and determine their annual impact in dollars. This would tell us what the “carrot” was, and whether or not they were worth conducting a formal RCA on, experience shows through the Pareto Principle, that when such a list is aggregated, the 20 percent or less of the events identified account for 80 percent or more of the dollars lost. This is a good technique to provide focus for a disciplined RCA effort.

So where does the data come from to populate this type of spreadsheet? There are numerous means by which such lists can be developed, but how confident are we in the data. Think about this day and time, and where such information can reside: our ERP system, RCM system, CMMS, etc. How many of us really believe that such systems accurately reflect the field activity, especially when it comes to the recording of every chronic event?

It has been my experience that when a chronic event occurs, from the perception of the person tasked to fix the undesirable event, it takes more time to input the information into the recording system that it does to fix the problem. Usually a negative connotation of the information system is involved and it is deemed too cumbersome, so we will just fix the problem and be on our way. After all, that is what we are pressured to do – fix it and get production going again.

While we can get some information from such on-line monitoring systems, we must recognize that they are not all inclusive at this time. Only the people closest to the work will truly have the knowledge of the most chronic events. It is in their heads, not on paper!

Typically most information systems are labeled and advertised as asset management systems. So failures that affect the asset are typically what are recorded. However, what may not be recorded are events that produce off-spec product where no mechanical failure occurs, time delays as a result of a crane not showing up on time during a shutdown, time delays due to the wrong parts delivered to the site, or late deliveries to customers.

How do such asset management systems handle these events? Where is it recorded that such occurrences are undesirable and how are proactive recommendations from RCAs processed in a timely fashion?

If we conclude in our RCA that procedures are obsolete, specifications are incorrect, or that people were not trained properly to perform a task, how are these situations handled in the asset management system? These questions are food for thought when we consider how well our current environment supports the task of root cause analysis.

We can be the greatest failure analysts on the planet, but if we are working on the wrong events and our environment does not support the proactive activity, then we are likely to become frustrated ourselves and fall into the paradigm that “if management does not care, then why should I”? Once this attitude sets in, complacency with a reactive culture is the norm and overall profitability suffers.

What we need to do today is make management aware through education and awareness that our cultures live with these chronic events that typically end up costing 100 times more than the occasional sporadic event. Unfortunately, the sporadic events get all the attention. When our cultures are enlightened, we will begin to enjoy the fruits of our efforts in the form of return on investment (ROI) figures as high as 7000-8000 percent. We just need to focus on Root Cause Analysis, chronic events or high visibility, but the more we focus on chronic events, the less high visibility events there will be. Then the believers will come.

Robert Latino is the Principal of Prelical Solutions, LLC., a practical reliability consulting firm that is helping companies realize their reliability potential.

Analyzing Semiconductor Failure

Semiconductor devices are almost always part of a larger, more complex piece of electronic equipment. These devices operate in concert with other circuit elements and are subject to system, subsystem and environmental influences. When equipment fails in the field or on the shop floor, technicians usually begin their evaluations with the unit's smallest, most easily replaceable module or subsystem. The subsystem is then sent to a lab, where technicians troubleshoot the problem to an individual component, which is then removed--often with less-than-controlled thermal, mechanical and electrical stresses--and submitted to a laboratory for analysis. Although this isn't the optimal failure analysis path, it is generally what actually happens.

Improvement: What Comes First?

I use the term RCPE because it is a waste of good initiatives and time to only find the root cause of a problem, but not fixing it. I like to use the word problem; a more common terminology is Root Cause Failure Analysis (RCFA), instead of failure because the word failure often leads to a focus on equipment and maintenance. The word problem includes all operational, quality, speed, high costs and other losses. To eliminate problems is a joint responsibility between operations, maintenance and engineering.

An Integrated Process for System Maintenance, Fault Diagnosis and Support

This paper presents an overview of an integrated process for system maintenance, fault diagnosis and support. The solution is based on Qualtech System, Inc.’s (QSI’s) TEAMS toolset for integrated diagnostics and involves several key innovations. As a showcase of the integrated solution, QSI, along with Antech Systems and Carnegie Mellon University (CMU), have recently completed a research project for the Information Technology Branch at the Naval Air Warfare Center–Aircraft Division (NAWC-AD) in St. Inigoes, MD. The entire system, termed ADAPTS (Adaptive Diagnostic And Personalized Technical Support), provides a comprehensive solution to integrated maintenance and training.

Anatomy of a Boiler Failure—A Different Perspective

The power industry’s operating and maintenance practices were held up to intense regulator and public scrutiny when on November 6, 2007, a Massachusetts power plant’s steam-generating boiler exploded and three men died. The Department of Public Safety’s Incident Report investigation determined that the primary cause of the Dominion Energy New England’s Salem Harbor Generating Station Unit 3 explosion was extensive corrosion of boiler tubes

Anatomy of a Hydraulic Pump Failure

I was asked recently to give a second opinion on the cause of failure of an axial piston pump. The hydraulic pump had failed after a short period in service and my client had pursued a warranty claim with the manufacturer. The manufacturer rejected the warranty claim on the basis that the failure had been caused by contamination of the hydraulic fluid. The foundation for this assessment was scoring damage to the valve plate.

Are We Willing to Hear What “Failure” Has to Say?

Root Cause Analysis has the potential of CHANGING people, IF the leader of the investigation knows of this potential. Far from “just another problem-solving exercise,”the root cause analysis should SLOW PEOPLE DOWN to the extent that they can see the truth of the incident under inquiry, WHATEVER THE TRUTH MIGHT BE. This paper focuses on two parts of our human nature which are large obstacles to root cause discovery, i.e., our unwillingness to slow down, and our unwillingness to let go of certain basic assumptions about life. Warning: This paper is designed to challenge the way you think about Root Cause Analysis.

Definition of Root Cause Analysis (RCA)

A fault tree is constructed starting with the final failure and progressively tracing each cause that led to the previous cause. This continues till the trail can be traced back no further. Each result of a cause must clearly flow from its predecessor (the one before it). If it is clear that a step is missing between causes it is added in and evidence looked for to support its presence. Below is a sample fault tree for the moral story of the kingdom lost because of a missing horseshoe nail.

Root Cause Analysis Chronic Events: Panning For Gold