There is no doubt early reliability techniques evolved to strategically handle plant maintenance. Predicting failures in advance and allocating sufficient resources helped the industry mitigate the impact of unexpected failures and unplanned outages. The early techniques also prevented maintenance folks from patching up the failures and demanded detailed root cause studies so problems leading to failures are fixed once and for all. But after many years, the function of reliability is still so tightly married to maintenance that it is often perceived to be the only combination that can unlock all challenges related to an asset. But can maintenance alone handle all aspects of reliability throughout the lifecycle of an asset? What if the asset has inherent design flaws or inadequate commissioning procedures? What if it is being operated outside of its operating parameters? Such issues are related to engineering and operations, which are outside of the maintenance scope.
The primary function of maintenance, which takes precedence over all other roles, is “firefighting.” When equipment breaks down, the maintenance team is expected to return it to service immediately so production can be restored. The function of reliability has nothing to do with this firefighting approach. Thinking of reliability as an improved maintenance practice does not do justice to this function and limits the imagination to what it can truly deliver. It is very common in the industry to interchange the term maintenance engineer with reliability engineer without much thought. The majority of reliability engineers are still perceived as “smart” maintenance engineers or engineers who deal with maintenance-related issues of high dollar value assets, such as compressors and large rotating equipment. It almost seems inconceivable, even to many industry professionals, that a reliability engineer has a much broader goal and is better off being an independent entity overseeing engineering, operations and maintenance (EOM) to embed reliability at every level. Just as the safety group is effective in incorporating its policies and programs within EOM’s routine business, the notion of having an independent reliability group doing the same for reliability programs and initiatives should not be alien at all.
Think of it this way: If you want a reliable car, you must first ensure that the car is built to be reliable. You then make sure it is operated as it was intended to be during its design. Lastly, you focus on maintenance and take measures to do it right and on time. All three, the designer, the operator and the maintainer, must adhere to reliability to have a reliable car. The same philosophy applies to the plant’s assets. Maintenance alone should not be deemed responsible for their reliability. Just like a car, equipment only can be truly reliable if the engineering team ensures reliability during the procurement and commissioning phases, the operations crew operates it as per the standard operating procedures without exceeding the operating envelope, and the maintenance folks exercise due diligence in maintaining reliability with quality workmanship.
The single entity that can ensure the collaboration within these three disciplines and oversee reliability throughout the entire lifecycle of an asset is the function of a reliability engineer. The reliability engineer does not necessarily have to be a mechanical engineer. This role also can be performed by an electrical or instrumentation and control (I&C) engineer having sufficient industry experience and sound knowledge of reliability engineering tools and techniques. However, for a reliability program to be effective in a plant, one must realize this is not a one-man show. Establishing an independent group to launch a comprehensive reliability program is recommended. The group must consist of several reliability engineers who are preferably a mix of mechanical, electrical and I&C engineers; computerized maintenance management system (CMMS) experts; senior technicians; and at least one representative each from EOM. The group must report to the entity having authority over EOM. This is important since reliability mandates changes and enhancements in the traditional style of work and, as commonly known, the change is always resisted and often counterattacked. Trying to accomplish this change laterally by having the reliability group part of maintenance will only make this task difficult and longer, if not implausible.
Figure 1: Reliability function as an independent external entity focusing on the niche areas of Engineering, Operations, and Maintenance (EOM) to improve asset reliability throughout its lifecycle (depicted from the book “Good to Great” by Jim Collins)
The reliability group must focus on strategic programs and management must realize that such programs require long-term commitment and support in order to produce expected results. Some recommended strategic programs and analytical activities that are labor-intensive and can be undertaken by the reliability group include:
- Reliability centered maintenance (RCM) program covering seven questions of RCM methodology.
- Develop preventive maintenance (PM) procedures and optimize utilizing failure modes.
- Implement predictive maintenance (PdM) technology and develop a continuous monitoring program. Some examples are online/off-line vibration monitors, ultrasound measurement devices, thermographs, oil condition monitoring and smart instrumentation online diagnostics.
- Operator driven reliability (ODR) program focusing on the operations role to enhance asset reliability. This may include detailed operators’ checklists for visual, audio, smell and feel tests, review and enhancement of standard operating procedures (SOP), accurate integrity operating window (IOW) for all equipment, simple handheld devices to collect data for off-line predictive maintenance and minor maintenance tasks, like tightening up loose bolts with basic tools.
- Failure mode and effects analysis (FMEA) or failure mode, effects and criticality analysis (FMECA).
- Root cause analysis (RCA) or root cause failure analysis (RCFA) covering effective failure reporting and close tracking of recommendations until fully implemented.
- Develop standard job plans (SJP) for better planning and scheduling of work orders with accurate resource handling.
- Participation in process hazard analysis (PHA), like a hazard and operability study (HAZOP).
- Layer of protection analysis (LOPA) or other qualitative analyses for safety instrumented systems.
- Safety instrumented system (SIS) lifecycle management covering all phases from cradle to grave and compliance with industry and company standards. Safety instrumented function (SIF) performance, like actual demand rate, detected failure rates, proof test compliance, diagnostics, etc., also should be part of this program.
- Functional testing procedures and relevant documentation for non-SIS related equipment.
- Initiation and tracking of a lessons learned database for reliability.
- Review of capital project packages with reliability enhancement recommendations.
- Bad actors identification, tracking and replacement program.
- Advanced reliability analyses, including, but not limited to, Weibull analysis, Markov modeling, lean and/or Six Sigma study and reliability, availability and maintainability (RAM).
- Obsolete equipment tracking and systematic replacement program.
- Ad hoc site visits to witness the operations and maintenance work with the intention of issuing recommendations for identified gaps.
- Random checks for CMMS or systems, applications and products (SAP) data entry and quality.
- Single point of failure (SPF) identification and enhancement.
- Design for reliability (DFR) program and related studies.
- Reliability performance metrics for developing, tracking and enhancing leading and lagging key performance indicators (KPIs). Examples include mean time between failures (MTBF), mean time to repair (MTTR), overall equipment efficiency (OEE), equipment availability and equipment probability of failure on demand (PFD).
This is by no means a comprehensive list, but each organization can customize it with additional programs based on its size and resources. Several programs listed are considered “living” and require a twofold approach to be fruitful. First, develop a detailed scope of work covering the feasibility and implementation requirements for management support and approval. Second, execute the program and be on the lookout for areas of improvement while constantly bridging any gaps identified.
In conclusion, the function of reliability has evolved immensely over the years. Allowing reliability to operate independently, focusing on its core strengths without getting consumed by daily firefighting work from maintenance will take this function to another dimension of ingenuity. Empowering the reliability group to have jurisdiction over engineering, operations and maintenance will ease and expedite the part of change management while shortening the implementation period through improved collaboration. This is essential since many reliability initiatives fail during the execution phase when management does not see results for an extended period and loses interest. When placed correctly in an organization, along with the necessary expertise, resources and authority, the function of reliability will demonstrate the true potential it has in improving asset reliability during its lifecycle and achieving a robust reliability culture throughout the facility.
Disclaimer: The author’s views do not necessarily reflect the views of his employers, colleagues, or any professional societies in which he is affiliated.