Why Considering the Likelihood of Failure is Bad for Business Risk

JD Solomon
Sep 5, 2022
4 min read

Caterpillar mechanic. Communicate with FINESSE. — Caterpillar mechanic servicing a productive engine long past its predicted Likelihood of Failure.

An incredible number of technical professionals believe that effective asset management can only be performed with a meaningful prediction of the likelihood of failure. This belief, in turn, has led to efforts over the past 30 years to a desire to collect massive amounts of historical data and information.

Nothing can be further from the truth. Evaluating the likelihood of failure is a fool's game – more wrong than right. As it turns out, small data is just as good as big data. And in the cases that matter most, small data (at best) is all we will have.

Complexity

A system is a collection of interrelated parts that produce an outcome that the individual parts cannot. By definition, complexity is a collection of many parts. Most industrial and human systems are complex.

A system's upside is that all interrelated parts do not have to perform optimally to produce the desired result. This is also the downside––it is difficult, if not impossible, to complete a predictive description of how each part or asset will fail. Context (operating conditions and environment) matters, the definition of failure matters, and many components fail with the same symptoms.

Sorting through the interrelated behavior of which components cause what is necessarily resource intensive, if we can even pinpoint when the failure occurred. Most organizations do not expend the necessary resources with any degree of consistency. In other words, in practice, we have difficulty sorting through which components truly cause a larger system not to perform as the user desires.

Sample Size and Evolving Systems

Most systems have a limited number of major assets and we do not let those run to failure. We simply do not have enough failure data on the things that matter most in their specific operating conditions and environment.

Major assets and their components are also constantly changing. This is partly driven by maintenance and engineering responsibilities to keep them performing. It is the result of manufacturers improving the components that make up the system over the long life of most major systems.

So even if we let the things that matter most run to failure, there are differences in the evolving major assets and components that create sample size issues in the statistical analysis.

Failure Reporting

Failure reporting changes over the life of an organization and over the life of major assets. Facilities within the same organization frequently report failures differently than their peer facilities. In some sectors, such as the Department of Defense, failure reporting is done meticulously in normal times but often ignored when under the most relevant stress of deployment.

In other words, differences in failure reporting, interpretation, and presentation can vary greatly by different users in the same systems or third-party service companies and manufacturers.

We Are Doing a Good Job

A contradiction in collecting large sets of failure data is frequently cited by Reliability Centered Maintenance (RCM) experts and credited to Howard Resnikoff, a former director of the Division of Information Science and Technology at the National Science Foundation. Resnikoff's conundrum states that if we successfully perform our work well, then we are suppressing the very failure data we need to build a statistically accurate failure model.

In other words, we may be collecting lots of failure data, but it is the wrong data — the data that doesn't matter much—because we are not allowing the things that matter the most to fail often enough.

How Does This Help You

You are doing the right thing if you are collecting monitoring data from your most critical systems at the right locations based on the way you believe things fail. That means you have probably performed a criticality analysis, performed failure modes & effects analysis (FMEA), and developed some type of finite element model before installing the data collection devices.

You are also doing the right thing if you are populating your enterprise asset management system with the best failure information you can and analyzing it for obvious trends in system performance. Hopefully, you are also experimenting and tweaking your system maintenance protocols and frequencies.

You are doing the WRONG thing if you make rigorous attempts to drive your asset management program by believing that you can predict the likelihood of failure correctly and consistently. There is not enough data to make a statistically accurate prediction because we do not consistently spend the resources to understand component level failures, we collect and report data inconsistently, critical systems are constantly evolving, and we do a good job of preventing failure in the things that matter most (so most of the failure data we do have are from things that do not matter as much).

This leaves us in the analytical world of small data combined with good judgment. The next time you think you can accurately predict the likelihood of failure, bet on when anything of importance, or even your hot water heater or car, will fail.

JD Solomon Inc provides reliability and risk assessments through our asset management services. Contact us for more information on how to streamline and make your risk-based asset management program more effective.

Yes! Send Me a Monthly Update from JD Solomon Inc!