Resilience Roundup - Resilience Engineering New directions for measuring and maintaining safety in complex systems Part 2 - Issue #45

Welcome back, last time we talked about a few theories that had started to form the foundation of accident investigation and theory. Today we’re going to continue but move forward in time.

Remember, that this isn’t just a history lesson, there’s no pop quiz at the end, don’t worry. But knowing that models and theories have been put forth and have evolved can help us evaluate our own organizations. Where are we in that evolution of thinking? How can we move forward and integrate new ideas?


Resilience Engineering: New directions for measuring and maintaining safety in complex systems

Latent Failure model

Next, is the idea of “Latent Failures” (which we’ve talked a bit about before). Also sometimes called the Swiss cheese model, the idea is that barriers exist at multiple levels, such that for an accident to occur many barriers would have to be breached.

There are potentially holes in each barrier, but with many barriers the odds of something that can traverse each barrier through a hole is slim.

The latent failure model developed by James Reason, advocates the idea that there are some number of failures already present in the system, but it’s only when they build up past a certain level or interact a particular way that an accident occurs. Reason likens it pathogens in the body, you always have them, but it takes some other event like them building up to a certain degree for you to get sick.

This model acknowledges that there are multiple contributing factors to an accident, some of which may have been present for a long time prior (latent). Each latent failure is something that produces some sort of negative effect, the consequences of which are invisible until it interacts with some other failure or failures.

The model also defines “active failures,” typically those things where the negative effect is obvious right away, often defined as an “unsafe act.”

Unfortunately the latent failure model still perpetuates the idea that some component is broken in the system to have an accident to occur. It also provides more relabeling of human error, with holes in defenses sometimes being attributed to “deficient training” or “poor procedures,” neither of which explain anything.

Ideally whatever model of accident progression you use would be able to point you in the direction of how to minimize negative outcomes. In the latent failure model, it’s said that the best way to do this is to get better at detecting latent failures and taking them seriously when they’re found. This sounds like a great idea, but it’s never explained how one does this. Looking at a complex system as a series of layers doesn’t do anything to explain how the latent failures are created, so there isn’t any direction to look toward when trying to detect them.

Normal Accident Theory

Normal accident theory was formed in the mid-80’s by Charles Perrow.

In systems with a lot of defensive barriers like medicine or aviation, they’re pretty well protected against single points of failure. The paradox according to Perrow is that because of the complexity induced by these defenses and their ability to limit visibility into the system, it’s much harder to see the beginnings of an accident and also difficult to stop it when it starts.

The things that make the system reliable become some of the things that make it complex. These are systems that are very large, have many specializations (so it can take a long time to learn a particular area), and are also tightly coupled, such that a change in one area directly affects another.

Perrow moved away from the idea of an individual anything, person, component or otherwise causing an accident but instead that “system accidents” are caused by the interaction of many things.

Though the accidents themselves may come from surprising interactions across various parts of the coupled system, normal accident theory tells us that there are accidents should be unsurprising. The more tightly coupled a system is and the more complex the more likely it is that it will suffer a “normal” accident.

Types of system interactions

It’s important to differentiate here between different types of system interactions, Linear and Complex, Perrow gives some contrasting examples:

Complex Systems Linear Systems
Tight spacing of equipment Equipment spread out
Proximate production steps Segregated production steps
Many common-mode connections of components not in production Common-mode connections limited to power supply and environment
Limited isolation of failed components Easy isolation of failed components
Unfamiliar and unintended feedback loops Few unfamiliar and unintended feedback loops
Indirect or inferential information sources Direct, on-line information sources
Personnel specialization limits awareness of dependencies Less personnel specialization
Limited understanding of some processes Extensive understanding of all processes

In this view, systems can either be linear or complex. But they can also be tightly coupled or loosely coupled. Perrow again provides some contrasting examples:

Tight coupling Loose coupling
Delays in processing not possible Processing delays possible
Invariant sequences Order of sequences can be changed
Buffers and redundancies exist but are limited to what has been deliberately designed in Buffers and redundancies available
Only one method to achieving goal Alternative methods available

Perrow saw these two properties being at odds as a big problem. He believed that a system with high interactive complexity could only cope with it well by having a decentralized organization. On the other hand, an organization that was tightly coupled needed a centralized organization.

What to do when an organization was interactively complex and tightly coupled. That is where the problem lies in Perrow’s view, since in his view, an organization can’t be both centralized and decentralized at the same time. This means that under this view, systems that occupy that space can’t be controlled well.

Of course, an organization can be centralized in some places and decentralized in others. They can be centralized in how they set and distribute policy and procedure, but still allow decentralized decision making in the field. EMS comes to mind for this. There are a lot of procedures, Federal standards, etc.., but there are problems that aren’t specifically covered that one must make decisions for when the time comes.

In normal accident theory, “human error,” is a label for the problems that occur when you have systems that are interactively complex and tightly coupled. Perrow also recognized that the label could also be influenced by politics, saying:

“Formal accident investigations usually start with an assumption that the operator must have failed, and if this attribution can be made, that is the end of serious inquiry. Finding that faulty designs were responsible would entail enormous shutdown and retrofitting costs; finding that management was responsible would threaten those in charge, but finding that operators were responsible preserves the system, with some soporific injunctions about better training.”

Issues arise in this normal accident theory because when looking at those two dimensions, complexity and coupling, it has to be relative to the people. We can’t say that a system has unintended feedback loops or that there is a limited understanding of its processes is, without considering both the human the system.

Because of this, those two dimensions cannot really be as sharply divided and separate things as normal accident theory suggests. Further, even if those measures would be true of a system, they wouldn’t necessarily stay that way. Coupling can increase during high demand periods and be lower in others.

Takeaways

  • Understanding how these models and theories evolved can help us more clearly see how our teams and organizations view accidents. Often, some models or parts are going to look very familiar.
  • The latent failure model is still very common. Elements are seen in many areas of accident investigation.
  • The idea of there being latent failures in the system that eventually cause accidents when exposed to certain conditions, is a step in the right direction in understanding systemic accidents, but doesn’t explain how they’re formed or how to prevent their forming.
  • Normal accident theory tells us that accidents tend to occur and should be expected in complex systems.
  • Normal accident theory primarily looks at systems along two dimensions, coupling (loose or tight) and interaction type (linear or complex)
    • These lines aren’t as sharp as they may seem as their definitions must relate to the people who are making the assessments, so can vary.

Don't miss out on the next issue!