Ask the Experts: Reliability and Risk Assessments
"Commercial Airplanes Are Reliable” is US Coast Guard Electricians Mate Second Class Paul Frantz’s artistic interpretation of the outcome of reliability and risk assessments. American commercial aviation is by huge margins the safest way to travel. Continually addressing reliability and risk is the reason. It is also the solution in all industry sectors.
This month’s Ask the Experts is a one-on-one lightning round with Alison Adams. Dr. Adams has worked in Florida on large-scale facilities and natural resources projects for over 30 years. As the long-time Chief Technical Officer for Tampa Bay Water (TBW), she was instrumental in evaluating reliability and risk for major capital improvement and asset management programs that re-defined how the greater Tampa Bay region met growth and regulatory challenges. Alison has directed research into climate variability and is seen as a national leader in its potential effects on facilities and infrastructure.
We were pleased to have Paul Crocker moderate the conversation. Paul leads the maintenance and reliability services for the drinking water facilities at the Kansas City Board of Public Utilities (BPU), a 100-year-old utility that provides safe, dependable water and electric services in Kansas City, Kansas. Paul is seen as a national thought leader in his field of practice; his practitioner credentials include Certified Reliability Leader (CRL), Certified Maintenance and Reliability Professional (CMRP), KDHE Class IV Water Operator, and ABC Class III Plant Maintenance Technologist.
Do the terms “reliability” and “risk” mean different things to different people?
AA: Yes. We need to look at all aspects of each before we use the terms interchangeably.
JD: Yes. And people tend to invoke either term to justify when they want to do something. Or sometimes when they do not want to do something.
AA: A lot of times an organization will use the term reliability too broadly and out of context. They use it as a justification to do a lot of things that have nothing to do with reliability.
JD: Getting the definitions established up front is important in every reliability and risk assessment.
What aspects need to be considered when doing an assessment for an existing facility or system?
JD: Understand the original basis of design, understand the current conditions, and understand the future conditions.
AA: I would add to that “understanding what your system is”.
JD: We usually have many engineers, contractors, and administrations who have done things differently. It takes work to understand the true reliability statement of an existing facility.
AA: Understanding what you know and don’t know. Hopefully you have a good understanding of your system. It’s important though to understand what you don’t know and be willing to look into those unknowns.
JD: I love the part of knowing what you don’t know and be willing to address that. Oftentimes we migrate to only what we know, and that is a form of bias. It takes work to understand what you don’t know.
Do you think the culture of the organization is important when you do reliability and risk assessments?
AA: Yes. It’s not about just the equipment. It’s about the people as well. It is critical to success to have participation
JD: Organization culture is important because it take a willingness to do the work
AA: The culture of the organization is critical.
JD: We can’t build a system that you can’t operate
What aspects are most frequently overlooked or undervalued?
AA: The staff. The assessment process should include the staff as well as equipment and physical assets.
JD: We often tend to drill down on the equipment and overlook the human aspects.
AA: Training, education, experience, standard operating procedures should all be considered in the assessment. We often tend to undervalue humans in the evaluation.
JD: I have changed over the past decade to include the human factors and human error aspects – a lot of that does come down to the standard operating procedures. In the beginning, I focused mostly on the equipment. Most of our reliability and risk issues are not solely equipment related. The approach has to integrate both the equipment and human elements to be successful.
AA: It is important to understand that it goes beyond just the reliability and risk assessment aspects. Many organizations are struggling with succession planning aspects. The documentation is poor, and the younger generation does not want to spend 10 to 15 years learning the job from the ground up. You need to have the documentation in place so they can learn the job in a few years and move on.
Who should be included in the assessment?
JD: Cross-functional team. Engineering, operations, maintenance, health and safety, finance, and probably some input from human resources.
AA: I have one word - everyone.
JD: Operators and maintenance professionals under how the system really works and how well it can perform. I am an engineer too, but freely admit that engineers often get optimistic with what we can do with a system on paper. Operations and maintenance best understand system capabilities.
AA: Risk and Reliability involve everyone in the organization.
Who should lead the assessment – engineering, operations, a consultant, some other entity?
AA: I feel very strongly that the efforts needs an internal organization champion. But that person cannot do it with only internal resources. You really need the outside help, especially to avoid the internal groupthink. You also don’t have all the knowledge and experience you need inside most organizations.
JD: Engineering needs to lead the process, despite what I just said. Engineers are usually the best at the technical analysis and forecasting, and normally lead the master planning. I agree with Alison too that it needs to be a “big picture” perspective, and not just a single discipline perspective. I really think it needs to be someone with formal reliability or risk management training.
AA: Yes, it needs someone with a larger perspective. Someone who has some training, experience, knowledge or love of system-type approaches for problem solving. They are looking at risk and reliability across an organization.
JD: You also mentioned consultants, and I think everything that was said about the outside view, the additional experience, and even the ability to provide some benchmarking against others is important. I often see organization rely too much on the consultants.
Are the leaders of the reliability and risk assessment effort also the facilitators?
AA: The facilitator is a unique person and not usually the leader, it is often difficult for the inside person to be neutral and objective.
JD: It is an important role to use a consultant. It is much easier for the consultant to bring the neutral perspective
AA: It is often a good role for an outside consultant
JD: It need to be someone who needs to know what they are talking about to do this type of facilitation. The facilitator does not necessarily need to be a subject matter expert, but it can’t be some general facilitator from the local community college either.
AA: That is exactly right. The facilitator needs to understand the basic concepts and what people in the room are saying. The need to be able to throw the ‘BS flag’ on the floor if needed. Otherwise, you are wasting everyone’s time. The assessment will not be effective.
How important is framing the problem and establishing system boundaries?
JD: Very important. Two kinds – geographic boundaries, like fence lines, and operational boundaries, like adding chemical at a reservoir to assist in final treatment at a water plant.
AA: I totally agree, I will add that the organization does not need to get overwhelmed with too large of a problem frame. That will create paralysis.
JD: I agree with that, too. Sometimes you find that there is actually more than one issue when you establish system boundaries and frame the problem. A number of times I have worked with a group to turn one assessment into two or three assessments ones so that the group does not get overwhelmed.
AA: When you do that you also can get some quick wins that build confidence and understanding. It helps the group to build momentum for the larger reliability and risk issues.
Should impacts to future operating conditions, such as climate change, changing regulations, and changing customer expectations be considered?
AA: Yes, You should. I will caveat that by saying that fundamentally you need to first understand how your system works today. Then you can move to future conditions.
JD: Changing future inputs needs to be included and modeled.
AA: To do the capital planning and asset management you must be able to look at 20, 30, or 40 years and the things that will impact your system. Do you need different types of electrical systems, are your inputs changing, do you need more flexibility, do you need different types of data and data collection?
JD: The things that trick people is changing human things like regulations. Many of those will get delayed as you get closer. Many of the physical aspects, while variable, are more predictable.
AA: You really need to think about things in terms of flexibility. Things like what parts of my system can be designed and constructed in a way to be more easily change out.
Anticipating different future operating conditions is tricky. What are some approaches or tools that you have used to do this?
JD: Two ‘softer’ ones are structured brainstorming and scenario planning. I usually start assessment with structured brainstorming and then, as we start moving to consensus, pull back and add some scenario planning to ground truth that we are being broad enough. In some cases, I start with scenario planning and then use the brainstorming as a ground-truthing exercise.
AA: I feel strongly that everyone needs a process model of their system. It should include all of the inputs, outputs, and interfaces. It takes time and effort, but you need to fundamentally understand how the existing system works.
JD: I agree. Quantitatively I use Monte Carlo analysis to better understand the range of future uncertainty.
AA: Agree. And once you have your process models built, it is relatively easy to run tens of thousands or hundreds of thousands of different iterations with Monte Carlo.
AA: I also like digital twins as an emerging tool in many industries. It is a good way for operators to test, understand how the system is vulnerable to breaking.
What is the best single tip that you would give someone who intends to perform a reliability and risk assessment for their existing facility(s)?
AA: Leadership by top management.
JD: Mine is to have the big picture leader – the ‘systems-type thinker’ - involved, preferably with formal reliability and risk training.
AA: The next most important is total engagement across the organization.
JD: That would be my second one too. Cross-functional engagement across the organization.
1. You can get 15 or 20 different definitions of reliability and risk in just about any large group you are in.
2. You could make an airplane out of the best materials and equipment, including adding multiple layers of redundancy, but it may not get off the ground. You must assess and align your system with right level of reliability and risk for the context and the people.
3. We had an old plant and a long-time staff when I joined. We built a new plant with high automation. What got missed, and what we have had to do, is to develop the procedures for the new generation to learn.
4. Digital twins are a powerful tool for operators to understand the reliability and risk associated with their systems. It is also a powerful training tool.
5. In medium and small organizations, it is both difficult and important to create the reliability, risk and asset management roles.