Unleash the Potential of Your Reliability Function

Obaidullah A. Syed

There is no doubt early reliability techniques evolved to strategically handle plant maintenance. Predicting failures in advance and allocating sufficient resources helped the industry mitigate the impact of unexpected failures and unplanned outages. The early techniques also prevented maintenance folks from patching up the failures and demanded detailed root cause studies so problems leading to failures are fixed once and for all. But after many years, the function of reliability is still so tightly married to maintenance that it is often perceived to be the only combination that can unlock all challenges related to an asset. But can maintenance alone handle all aspects of reliability throughout the lifecycle of an asset? What if the asset has inherent design flaws or inadequate commissioning procedures? What if it is being operated outside of its operating parameters? Such issues are related to engineering and operations, which are outside of the maintenance scope.

The primary function of maintenance, which takes precedence over all other roles, is “firefighting.” When equipment breaks down, the maintenance team is expected to return it to service immediately so production can be restored. The function of reliability has nothing to do with this firefighting approach. Thinking of reliability as an improved maintenance practice does not do justice to this function and limits the imagination to what it can truly deliver. It is very common in the industry to interchange the term maintenance engineer with reliability engineer without much thought. The majority of reliability engineers are still perceived as “smart” maintenance engineers or engineers who deal with maintenance-related issues of high dollar value assets, such as compressors and large rotating equipment. It almost seems inconceivable, even to many industry professionals, that a reliability engineer has a much broader goal and is better off being an independent entity overseeing engineering, operations and maintenance (EOM) to embed reliability at every level. Just as the safety group is effective in incorporating its policies and programs within EOM’s routine business, the notion of having an independent reliability group doing the same for reliability programs and initiatives should not be alien at all.

Think of it this way: If you want a reliable car, you must first ensure that the car is built to be reliable. You then make sure it is operated as it was intended to be during its design. Lastly, you focus on maintenance and take measures to do it right and on time. All three, the designer, the operator and the maintainer, must adhere to reliability to have a reliable car. The same philosophy applies to the plant’s assets. Maintenance alone should not be deemed responsible for their reliability. Just like a car, equipment only can be truly reliable if the engineering team ensures reliability during the procurement and commissioning phases, the operations crew operates it as per the standard operating procedures without exceeding the operating envelope, and the maintenance folks exercise due diligence in maintaining reliability with quality workmanship.

The single entity that can ensure the collaboration within these three disciplines and oversee reliability throughout the entire lifecycle of an asset is the function of a reliability engineer. The reliability engineer does not necessarily have to be a mechanical engineer. This role also can be performed by an electrical or instrumentation and control (I&C) engineer having sufficient industry experience and sound knowledge of reliability engineering tools and techniques. However, for a reliability program to be effective in a plant, one must realize this is not a one-man show. Establishing an independent group to launch a comprehensive reliability program is recommended. The group must consist of several reliability engineers who are preferably a mix of mechanical, electrical and I&C engineers; computerized maintenance management system (CMMS) experts; senior technicians; and at least one representative each from EOM. The group must report to the entity having authority over EOM. This is important since reliability mandates changes and enhancements in the traditional style of work and, as commonly known, the change is always resisted and often counterattacked. Trying to accomplish this change laterally by having the reliability group part of maintenance will only make this task difficult and longer, if not implausible.

Figure 1: Reliability function as an independent external entity focusing on the niche areas of Engineering, Operations, and Maintenance (EOM) to improve asset reliability throughout its lifecycle *(depicted from the book “Good to Great” by Jim Collins)*

The reliability group must focus on strategic programs and management must realize that such programs require long-term commitment and support in order to produce expected results. Some recommended strategic programs and analytical activities that are labor-intensive and can be undertaken by the reliability group include:

Reliability centered maintenance (RCM) program covering seven questions of RCM methodology.
Develop preventive maintenance (PM) procedures and optimize utilizing failure modes.
Implement predictive maintenance (PdM) technology and develop a continuous monitoring program. Some examples are online/off-line vibration monitors, ultrasound measurement devices, thermographs, oil condition monitoring and smart instrumentation online diagnostics.
Operator driven reliability (ODR) program focusing on the operations role to enhance asset reliability. This may include detailed operators’ checklists for visual, audio, smell and feel tests, review and enhancement of standard operating procedures (SOP), accurate integrity operating window (IOW) for all equipment, simple handheld devices to collect data for off-line predictive maintenance and minor maintenance tasks, like tightening up loose bolts with basic tools.
Failure mode and effects analysis (FMEA) or failure mode, effects and criticality analysis (FMECA).
Root cause analysis (RCA) or root cause failure analysis (RCFA) covering effective failure reporting and close tracking of recommendations until fully implemented.
Develop standard job plans (SJP) for better planning and scheduling of work orders with accurate resource handling.
Participation in process hazard analysis (PHA), like a hazard and operability study (HAZOP).
Layer of protection analysis (LOPA) or other qualitative analyses for safety instrumented systems.
Safety instrumented system (SIS) lifecycle management covering all phases from cradle to grave and compliance with industry and company standards. Safety instrumented function (SIF) performance, like actual demand rate, detected failure rates, proof test compliance, diagnostics, etc., also should be part of this program.
Functional testing procedures and relevant documentation for non-SIS related equipment.
Initiation and tracking of a lessons learned database for reliability.
Review of capital project packages with reliability enhancement recommendations.
Bad actors identification, tracking and replacement program.
Advanced reliability analyses, including, but not limited to, Weibull analysis, Markov modeling, lean and/or Six Sigma study and reliability, availability and maintainability (RAM).
Obsolete equipment tracking and systematic replacement program.
Ad hoc site visits to witness the operations and maintenance work with the intention of issuing recommendations for identified gaps.
Random checks for CMMS or systems, applications and products (SAP) data entry and quality.
Single point of failure (SPF) identification and enhancement.
Design for reliability (DFR) program and related studies.
Reliability performance metrics for developing, tracking and enhancing leading and lagging key performance indicators (KPIs). Examples include mean time between failures (MTBF), mean time to repair (MTTR), overall equipment efficiency (OEE), equipment availability and equipment probability of failure on demand (PFD).

This is by no means a comprehensive list, but each organization can customize it with additional programs based on its size and resources. Several programs listed are considered “living” and require a twofold approach to be fruitful. First, develop a detailed scope of work covering the feasibility and implementation requirements for management support and approval. Second, execute the program and be on the lookout for areas of improvement while constantly bridging any gaps identified.

In conclusion, the function of reliability has evolved immensely over the years. Allowing reliability to operate independently, focusing on its core strengths without getting consumed by daily firefighting work from maintenance will take this function to another dimension of ingenuity. Empowering the reliability group to have jurisdiction over engineering, operations and maintenance will ease and expedite the part of change management while shortening the implementation period through improved collaboration. This is essential since many reliability initiatives fail during the execution phase when management does not see results for an extended period and loses interest. When placed correctly in an organization, along with the necessary expertise, resources and authority, the function of reliability will demonstrate the true potential it has in improving asset reliability during its lifecycle and achieving a robust reliability culture throughout the facility.

Disclaimer: The author’s views do not necessarily reflect the views of his employers, colleagues, or any professional societies in which he is affiliated.

Obaidullah A. Syed

Obaidullah A. Syed, CFSP, CMRP, P.E., has nearly 20 years of experience in the controls and automation industry, primarily serving oil and gas companies. He worked as a consultant engineer in the United States for over 10 years before moving to Saudi Arabia in 2007, using his expertise to lead safety and reliability engineering related programs.

How to Fix the 70/30 Phenomenon

When you ask front line supervisors or team leaders if all people in their teams are performing to the same standards or if some are doing more work and achieving more results than others, you will often get the same answer. All over the world, the most common answer, after some analysis, verifies that about 30% of the people do 70% of the work.

Zen and the Art of Managing Maintenance

Unfettered expression and spiritual satisfaction? How does this relate to managing a maintenance department, especially one in the U.S. Postal Service? Open your mind. Take a page from the Zen Buddhist monks who preach: When you are quiet and listen, you become aware of sounds not normally heard. USPS maintenance leaders are listening and beginning to understand that maintenance success doesn't come through closed minds and closed doors.

Why do maintenance improvement initiatives fail to deliver? (Hedgehog or Fox?)

It is not uncommon that many reliability and maintenance improvement initiatives fail to deliver expected results. Why is it so? Some of the most common causes I have observed include:

Why Maintenance Improvement Efforts Fail

Why do improvement efforts fail or perhaps not sustain the gains? There are many reasons, but those most often stated are “lack of commitment” and not “following the process”. But why is there lack of commitment, and why aren’t processes followed? Here are a few of the reasons that I’ve seen:

TPM and RCM: Whirled Class

When a piece of production machinery broke down at the Whirlpool plant in Findlay, Ohio, several years back, it was accepted practice for the machine operator to call maintenance and then sit back and wait for the problem to be fixed. Critical information and knowledge was not shared between the operator and maintenance technician. Like many companies, these workers were stuck in traditional roles - operators run the machines, maintenance fixes the machines, and the two do not cross. As a result, productivity opportunities were missed.

Where Do Maintenance Professionals Come From?

Many managers are unaware that best-in-class companies routinely design-out maintenance at the inception of a project. That, clearly, is the first key to highest equipment reliability and plant profitability. Whenever maintenance events occur as time goes on, the real industry leaders see every one of these events as an opportunity to upgrade. Indeed, upgrading is the second key, and upgrading is the job of highly trained, well-organized, knowledgeable reliability professionals.

TPM and Tecate: The New Translation

The true translation — might it be proper to say a new and improved translation? — is being used today by Cervecería Cuauhtemoc Moctezuma, one of the largest brewers of beer in Latin America. Known throughout this company as Mantenimiento Alto Desempeño (MAD), or translated as High-Performance Maintenance, the concept of TPM is alive and well at the company's six plants in Mexico. Perhaps the best example is at CCM's brewery in Tecate, located a short drive from the U.S.-Mexico border on the Baja California peninsula.

Unleash the Potential of Your Reliability Function