Resilience Roundup - Drifting into failure: theorizing the dynamics of disaster incubation - Issue #19

This is a paper from theoretical issues in ergonomics science by Sidney Dekker and Sean Pruchnicki from Daniel Hummerdal’s great Saftey Differently site.


Drifting into failure: theorizing the dynamics of disaster incubation “Organizations incubate accidents not because they’re doing all kinds of things wrong, but because they are doing most things right. And what they measure, count, record, tabulate and learn, even inside of their own safety management system, regulatory approval, auditing systems or loss prevention systems, might suggest nothing to the contrary”

The authors talk about the period before disasters occur known as the “incubation period.” This is the point in time before an accident or a disaster occurs where risk is increasing over potentially long periods of time. Of course, this risk is unrecognized, otherwise, it would not help incubate a future failure.

The term “incubation period” comes from Barry Turner. The origin of the term stemmed from an accident he was investigating in 1966, where some coal mine waste material slid into a village. During his investigation, he found that there were a number of events that were overlooked or just disregarded prior to the accident because they conflicted with then current beliefs about how safety and hazards worked. As Turner would later say, “Within this ‘incubation period’ a chain of discrepant events, or several chains of discrepant events, develop and accumulate unnoticed.”

As Turner’s investigation continued, he increasingly began to focus on sociology as opposed to engineering. He went on to suggest that failure needs to be understood over periods of time, not just one point in time. The authors provide an overview of a number of lenses through which we can view failure, safety, and resilience, beginning with high reliability organizations.

Though the title mentions theorizing, it is a good overview for understanding different ways of looking at accidents. or this paper, they were primarily concerned with “changes in what is considered acceptable or even noticed as unacceptable” over time.

There are a number of different things these dynamics have been linked to in research:

  • Erosion of safety constraints
  • Complexity of organizations
  • Limits on rationality and learning
  • Pre-rational, unacknowledged pressures of production
  • The inability of organizations to recover from disruptions

The authors quote Karl E. Weick saying:

“Success narrows perceptions, changes attitudes, reinforces a single way of doing business, breeds overconfidence in the adequacy of current practices, and reduces the acceptance of opposing points of view”

Another element of HRO research they highlight is narrowing the gap between the way management thinks the work is done and the way it is actually done by practitioners. It’s not enough to make sure policy or procedure is being followed, it requires, “Interest in what goes on there beyond whether it complies with pre-understood notions of protocol and procedure.”

What Nancy Vaughan calls ‘structural secrecy’ can contribute to overconfidence from past results as well. This is a combination of bureaucracy and knowledge silos that can create pockets of an organization that do not have the expertise to see what their actions can cause in another area. Interestingly her research has shown that very formal information exchanges like meetings and power points can actually make the problem worse.

Goal conflicts tend to be resolved by the people at the sharp end of the organization actually doing the work. These conflicts can be resolved using an explicit measure if one is provided, but if one is not then it is up to each individual practitioner to resolve this, time and time again.

“A key ingredient suspected in any incubation period is the organization’s preoccupation with production efficiency”

The authors discuss how the pressure to achieve goals of production can affect many different parts of an organization including managerial decisions, which can be measured relatively easily and directly. The problem comes when these pressures and decisions mask the erosion of previously established safety margins.

They suggest that since it is not straightforward to understand how decisions that may appear small can have such a large effect on the organization, the best process is to start by mapping the organization’s goals and noting where they conflict.

Of course, not all organizations handle goal conflicts the same way. Some organizations will explicitly state how to handle them and that it is up to the practitioners to resolve, whereas other organizations may never state what trade-offs should be made. When they do this job well, they are often celebrated or perhaps just considered to be “just doing their job.” But when an outcome is bad, then those same decisions are often used to explain or cited as the cause of failure. Further, how conflicting goals and constraints are resolved can be part of an organization’s culture which can then influence what might be seen as rational.

An idea called “the normalization of deviance” can also influence what risk is seen as acceptable. This is a process that occurs where a group’s idea of acceptable risk can continue to be accepted even when there are signs of increased potential danger. This can persist continuously up until something goes wrong that then highlights the gap between the real risk and how it was thought to be managed. At first, small changes from the established standard don’t get reported or seen unimportant, but those incremental changes help allowed normalization.

It’s important to note that choices that can be regarded as bad decisions, after an outcome is negative, are choices that seemed reasonable to people taking them at the time or else it would not have taken them. This again is the local rationality principle.

Control theory looks at the idea of the incubation period from another direction. As Dekker and Leveson explain, control theory is a way of tracing undesirable events by looking at interactions in parts of the system. The idea is that adverse events happen when there are disruptions or interactions between system components where that interaction is not managed well. An example would be a lack of safety constraints in the design or operation or erosion of the same. The authors do acknowledge that there is an amount of “retrospective normativism” in this idea, but the goal here is to capture these complex systems that are normally imperfect and involve people, society, and organizations.

Safe operation of dangerous processes is a matter of keeping components in equilibrium by continually making some changes, often small, in order to keep the system within safe bounds. As a result, when there are bad outcomes, it is not from a specific event or root cause, but from the normal interaction of system parts.

In this view, the potential for failure accumulates as more and more changes are made and accepted that move the system away from its safety margins. In loosely coupled systems this can occur over time when smaller subsystems drift and then become evidence when tighter coupling is needed, or as the authors call it “stochastic fits.”

As Jens Rasmussen tells us, the solution here is not to continually make more rules to control the system, but to help people understand where the boundaries are of system performance and help people develop skills at coping near the boundaries.

“Failure incubates nonrandom, opportunistically alongside her on the back of the very structures and processes that are supposed to prevent it. Incubation happens through normal processes of reconciling differential pressures on an organization against the background of uncertain technology and imperfect knowledge”

Takeaways

  • Looking at how people and organizations function across time and a number of incidents, not just a single snapshot or single accident, can help us understand more broadly
  • Accidents typically don’t happen from sudden, random events, but are incubated, sometimes for a long time beforehand.
  • Organizations that have accidents are usually doing many things right, not many things wrong.

**Who are you? ** I’m Thai Wood and I help teams build better software and systems

Want to work together? You can learn more about working with me here: https://ThaiWood.IO/consulting

Can you help me with this [question, paper, code, architecture, system]? I don’t have all of the answers, but I have some! Hit reply and I’ll get right back to you as soon as I can.

**Did someone awesome share this with you? ** That’s great! You can sign up yourself here: https://ResilienceRoundup.com

Want to send me a note? My postal address is 304 S. Jones Blvd #2292, Las Vegas, NV 89107


Don't miss out on the next issue!