Root Cause Problem Elimination (RCPE) is a better term than Root Cause Failure Analysis (RCFA) because the goal is not to only look at what caused the failure, it is to learn what caused the failure and prevent that failure from occurring again. An effective root cause problem elimination process can improve production reliability significantly. But, few organizations have a functioning RCPE process in place. This article will discuss common problems and some suggested solutions in order to improve the processes in place at your organization.
Don’t just analyze: Implement the Solutions
The business process is most commonly named Root Cause Failure Analysis (RCFA) or root cause analysis. The results wanted from the process are to eliminate the problem, NOT to analyze the failure. To convey the desired result to the organization, we at IDCON believe the name should be changed to Root Cause Problem Elimination (RCPE).
An example of RCPE results are plotted over time in figure 1. The results could have been a $ 1,800 cost if the problem had been analyzed but a solution not implemented. But, as the graph shows, a solution was implemented which generates an $8,000 profit per year.
Initially it costs money to identify and analyze the problem, in this case $1,800 for personnel, testing, and some consumables.
It also costs money to prioritize, plan, and schedule the corrective action ($200). The redesign and material cost to implement the solution is $1,800 = a total cost of $4,000 for the implemented solution. The cost avoidance from future problems is estimated to $8,000 per year. The figure shows a RCPE profit of $20,000 after 3 years, since problems year after year were avoided. The avoided cost will most likely continue to accumulate in the future and will be expected to project additional cost savings.
If an organization engages in Root Cause Problem Elimination it is critical to implement the discovered solutions otherwise the organization will end up paying for a wasted analysis. Something we see quite often is a lack of implementation and this begs an interesting question is…Why aren’t the investigated solutions implemented?
Figure 1. A typical RCPE costs money. In this case $4,000 to analyze, plan, schedule and implement a solution. The profit from an RCPE in form of avoided cost is accumulated at a rate of $8,000/ year. After 3 years a $20,000 is contributed to the RCPE. If the solution wouldn’t have been implemented, the result would have been -$1,800 for the analysis.
Start with the basics before engaging in RCPE
An example of a successful RCPE that cost a plant $1.1 Million was once described to me. To make a long story short, a team of people performed a RCPE for several weeks in order to discover that the root cause was worn out coupling bolts and missing bolts due to misalignment.
Their practice of root cause analysis was poor and their planning and scheduling practices didn’t allow mechanics time to align machinery properly. This analysis is described as a huge success.
But, the obvious question has to be asked: Why focus on RCPE in this plant?
If the plant would have had a basic PM in place, the missing bolts would have been found much earlier by looking at the coupling using a stroboscope (yes, guards should have OSHA specified inspection ports).
Organizations should not prioritize Root Cause Problem Elimination if they work in a highly reactive environment. It could be a good idea if an organization is in a somewhat reactive situation, but not in a highly reactive mode.
In a highly reactive mode it may sound like a good idea to start problem solving, but it doesn’t work. The reason is simple. Highly reactive organizations don’t need to analyze problems to find solutions, the problems tend to be quite apparent.
Common problems are poor foundations, corrosion, broken components that aren’t fixed yet, lack of bill of materials, disorganized spare parts and materials, lack of equipment numbers, an extensive maintenance backlog, lack of standard operating procedures and training for operators, the list goes on.
A highly reactive organization should prioritize work on basic preventive maintenance and planning and scheduling before they can free up time to do Root Cause Problem Elimination (RCPE). Even if RCPE was engaged in a highly reactive organization, the solutions would point at the obvious problems mentioned above.
The right people should engage in Root Cause Problem Elimination
RCPE programs are often designed to engage a facilitator and a few individuals. If necessary engaging larger groups can provide a great learning experience, however large groups should only be used on a case by case basis. Large groups tend to be hard to get together and will usually dissipate overtime if the meetings become too cumbersome. This is why it’s best to gather a group that will consistently meet to discuss and address the program.
Day to day root cause problem eliminations should therefore be managed by the frontline (hourly and first line supervision). If they run into a tricky problem, a larger group may be called.
I truly believe that 80% of all problems in your organization can be solved by the front line using simple problem solving skills. In order to be effective they need to be given the right tools and processes.
They are in most cases closest to the problem and can therefore collect data and observations better than anyone else in the organization. They usually have the technical knowledge needed to solve the problem. The piece lacking to be successful is often a problem solving process and discipline to follow that process.
It is also important to remember that it is management’s responsibility to design or provide a root cause process. Management is responsible for implementing this process in the mill, setting up responsibilities, and following up to ensure accountability. This is not unique to root cause; it applies to all business processes in the organization.
Change the Culture
“Downtime reported by department breaks the first rule of root cause problem elimination; Ask WHY not WHO.”
A typical example may be that a motor tripped at 2:00 am. In the morning meeting we ask, what happened? Well, a motor tripped, so it’s not operations, it’s a maintenance issue. It will be classified as electrical because it was an electrical motor that tripped.
The thought process may look something like the picture below. The actual story is that operations overloaded the process, therefore the motor tripped. The E/I mechanic reset the motor, but will not ever tell anyone what happened because the mechanic understands the culture in the organization and do not want to put his friend in operations in a bad spot.
Fig 2. Morning meeting: Motor tripped out causing production loss, should always as why and not whoFig 3. It is not uncommon to see downtime reporting classified by department in organizations. The classification of department without doing root cause problem elimination first contradicts the basic concept of always asking why and not who.
Problems should be classified by symptoms, equipment number and/ or component type in order to start off on the right foot. The problems may be categorized by cost and frequency of failure.
When an RCPE has been completed, we may want to classify the problems by department or equipment type. But, in most cases all departments will be part of the problem and the solution.
Torbjörn Idhammar is the president and CEO of IDCON INC., a Reliability and Maintenance Management Consulting Firm. Tor’s responsibilities include training IDCON consultants, product development, sales, and marketing. He gives advice to IDCON’s multi-site and international clients to ensure outcomes and deliverables are met.
Semiconductor devices are almost always part of a larger, more complex piece of electronic equipment. These devices operate in concert with other circuit elements and are subject to system, subsystem and environmental influences. When equipment fails in the field or on the shop floor, technicians usually begin their evaluations with the unit's smallest, most easily replaceable module or subsystem. The subsystem is then sent to a lab, where technicians troubleshoot the problem to an individual component, which is then removed--often with less-than-controlled thermal, mechanical and electrical stresses--and submitted to a laboratory for analysis. Although this isn't the optimal failure analysis path, it is generally what actually happens.
Semiconductor devices are almost always part of a larger, more complex piece of electronic equipment. These devices operate in concert with other circuit elements and are subject to system, subsystem and environmental influences. When equipment fails in the field or on the shop floor, technicians usually begin their evaluations with the unit's smallest, most easily replaceable module or subsystem. The subsystem is then sent to a lab, where technicians troubleshoot the problem to an individual component, which is then removed--often with less-than-controlled thermal, mechanical and electrical stresses--and submitted to a laboratory for analysis. Although this isn't the optimal failure analysis path, it is generally what actually happens.
I use the term RCPE because it is a waste of good initiatives and time to only find the root cause of a problem, but not fixing it. I like to use the word problem; a more common terminology is Root Cause Failure Analysis (RCFA), instead of failure because the word failure often leads to a focus on equipment and maintenance. The word problem includes all operational, quality, speed, high costs and other losses. To eliminate problems is a joint responsibility between operations, maintenance and engineering.
I use the term RCPE because it is a waste of good initiatives and time to only find the root cause of a problem, but not fixing it. I like to use the word problem; a more common terminology is Root Cause Failure Analysis (RCFA), instead of failure because the word failure often leads to a focus on equipment and maintenance. The word problem includes all operational, quality, speed, high costs and other losses. To eliminate problems is a joint responsibility between operations, maintenance and engineering.
This paper presents an overview of an integrated process for system maintenance, fault diagnosis and support. The solution is based on Qualtech System, Inc.’s (QSI’s) TEAMS toolset for integrated diagnostics and involves several key innovations. As a showcase of the integrated solution, QSI, along with Antech Systems and Carnegie Mellon University (CMU), have recently completed a research project for the Information Technology Branch at the Naval Air Warfare Center–Aircraft Division (NAWC-AD) in St. Inigoes, MD. The entire system, termed ADAPTS (Adaptive Diagnostic And Personalized Technical Support), provides a comprehensive solution to integrated maintenance and training.
This paper presents an overview of an integrated process for system maintenance, fault diagnosis and support. The solution is based on Qualtech System, Inc.’s (QSI’s) TEAMS toolset for integrated diagnostics and involves several key innovations. As a showcase of the integrated solution, QSI, along with Antech Systems and Carnegie Mellon University (CMU), have recently completed a research project for the Information Technology Branch at the Naval Air Warfare Center–Aircraft Division (NAWC-AD) in St. Inigoes, MD. The entire system, termed ADAPTS (Adaptive Diagnostic And Personalized Technical Support), provides a comprehensive solution to integrated maintenance and training.
The power industry’s operating and maintenance practices were held up to intense regulator and public scrutiny when on November 6, 2007, a Massachusetts power plant’s steam-generating boiler exploded and three men died. The Department of Public Safety’s Incident Report investigation determined that the primary cause of the Dominion Energy New England’s Salem Harbor Generating Station Unit 3 explosion was extensive corrosion of boiler tubes
The power industry’s operating and maintenance practices were held up to intense regulator and public scrutiny when on November 6, 2007, a Massachusetts power plant’s steam-generating boiler exploded and three men died. The Department of Public Safety’s Incident Report investigation determined that the primary cause of the Dominion Energy New England’s Salem Harbor Generating Station Unit 3 explosion was extensive corrosion of boiler tubes
I was asked recently to give a second opinion on the cause of failure of an axial piston pump. The hydraulic pump had failed after a short period in service and my client had pursued a warranty claim with the manufacturer. The manufacturer rejected the warranty claim on the basis that the failure had been caused by contamination of the hydraulic fluid. The foundation for this assessment was scoring damage to the valve plate.
I was asked recently to give a second opinion on the cause of failure of an axial piston pump. The hydraulic pump had failed after a short period in service and my client had pursued a warranty claim with the manufacturer. The manufacturer rejected the warranty claim on the basis that the failure had been caused by contamination of the hydraulic fluid. The foundation for this assessment was scoring damage to the valve plate.
Root Cause Analysis has the potential of CHANGING people, IF the leader of the investigation knows of this potential. Far from “just another problem-solving exercise,”the root cause analysis should SLOW PEOPLE DOWN to the extent that they can see the truth of the incident under inquiry, WHATEVER THE TRUTH MIGHT BE. This paper focuses on two parts of our human nature which are large obstacles to root cause discovery, i.e., our unwillingness to slow down, and our unwillingness to let go of certain basic assumptions about life. Warning: This paper is designed to challenge the way you think about Root Cause Analysis.
Root Cause Analysis has the potential of CHANGING people, IF the leader of the investigation knows of this potential. Far from “just another problem-solving exercise,”the root cause analysis should SLOW PEOPLE DOWN to the extent that they can see the truth of the incident under inquiry, WHATEVER THE TRUTH MIGHT BE. This paper focuses on two parts of our human nature which are large obstacles to root cause discovery, i.e., our unwillingness to slow down, and our unwillingness to let go of certain basic assumptions about life. Warning: This paper is designed to challenge the way you think about Root Cause Analysis.
A fault tree is constructed starting with the final failure and progressively tracing each cause that led to the previous cause. This continues till the trail can be traced back no further. Each result of a cause must clearly flow from its predecessor (the one before it). If it is clear that a step is missing between causes it is added in and evidence looked for to support its presence. Below is a sample fault tree for the moral story of the kingdom lost because of a missing horseshoe nail.
A fault tree is constructed starting with the final failure and progressively tracing each cause that led to the previous cause. This continues till the trail can be traced back no further. Each result of a cause must clearly flow from its predecessor (the one before it). If it is clear that a step is missing between causes it is added in and evidence looked for to support its presence. Below is a sample fault tree for the moral story of the kingdom lost because of a missing horseshoe nail.