Resilience Engineering: New directions for measuring and maintaining safety in complex systems

This week we have a big report by some names you’ll likely know, Sidney Dekker, Erik Hollnagel, David Woods and Richard Cook.

This is a big, useful and important report, so I’ll be visiting it over the next few issues. This is a useful report for us in software because it talks about resiliene direclty and goes over past theories and models and how they advanced.

They also cover what remnants of old models might still be impacting the typical accident investigation, which in my opinion those patterns can become pervasive enough to even affect us in software (e.g. root cause).

They start by going through the theories and ideas that helped make Resilence Engineering what it is today. This is more than just a history lesson, having an idea of how the thinking and research evolved can help us advance our own thinking or that of our organizations.

Pervasive ideas

Specific models aside, there are a few ideas that a view through a resilience lens helps us dispel. It’s one that is unfortunately all too common in many software and ops type organizations: that we opporate basically safe systems. The thought tends to be, the system is mostly safe as it was desined with safe components put together correctly.

This works fine for realtively simple machines and systems, make sure each component is safe, and the result is safe, right? The problem is that this component level view of saftey doesn’t apply in complex systems.

Further, this idea that tends to manifest as wanting to distance humans from the system, since if the system is safe, all we need to do is keep those unsafe humans out of it. Of course, we’re learning that this isn’t how it works at all.

In fact, qutie the opposite, that safety is not an inherent but emergent property of systems. The humans in it are adapting and creating safety through their work.

Series of events model

The series of events model is probably what I’ve seen used the most often, usually just by default. It basically says an accident is simply a direct series of causes and effect. Following these ideas means that accident investigation is simply taking the last link in the chain and working backwards until the first link or the “root cause.”

We live in a non-Newtonian world, I know that’s something that would be crazy to say from a physicis perspective, but when it comes to our modern systems, that’s definitely true. The idea that each effect has a direct cause, and an equal one at that, is what permiated most accident investigation theory early on.

Especially in software and SRE type areas, we’ve definitely learned that contributors can have an outsized effect and that causes don’t need to be equal in size to the reaction they create in our system.

There still some hold overs from this model though that are affecting us even today.

Relics of the series of events model

The series of events model has heavily influenced almost all major models and thinking about accidents since its inception. You can still see relics of it today.

One thing that it has left us with is an attempted division between incidents and accidents. If we subscribe to that model and think of accidents as a linear series of events, then it can seem intuitive to say that an incident is something that would have been an accident, the same series of events, except it was stopped before the end.

This can encourage further divisions, like near misses, which would be something that is approaching the boudnaries of an incident but is stopped before it reaches it.

Most of this theory was created modeling the physical world and attempts to make systems safe by containing energy. This means you tend to see a lot of barriers that either stop that or slow its release.

We of coruse use and create systems where this isn’t really a concern, so its not a great fit either way.

Man made disaster theory

You might think that a theory called “man made disaster theory” would be a step back, something that blames the operators, but its actually the first real start of the opposite. It was developed by Barry Turner in the 1970s and was the first to really look at accidents as a socialogical phenomenon.

The idea here is that accidents start with something small, after which there is some “incubation period,” which could be quite a long time. During this time, problems build up. At the same time the organization is changing the way it asses itself or the way it asses risk.

Eventually, this gulf between the assessed safety and reality culminates in an accident. This was a helpful theory in that it helped to show that small, normal, seemingly safe changes interacted over time to create accidents.

Unfortunately, it still preserved the ideas of “human error” as a cause, just pinning it to management instead of those at the sharpend. Further, it also perpuates the idea that accidents are still a linear chain of events.

What about you?

Do these seem familiar to you? Are they models that you or your organization still use? In what ways are they helpful or harmful to your work? Hit reply and let me know, I read every email.

Takeaways

Understanding previous models and how they evolved can help us shift our own thinking
Much of accidnet investigation is still pretty heavily influenced by early models that are primarily designed for relatively simple processess and machines
Though it had some problems of its own, Barry Turner’s man made disaster theory was was the first time looking at accidents as a sociological phenomenom
Constructing cause around management or similar, is really just another way of relocating where “human error” occurs
Older theories tend to revolve around physics and preventing the unsafe release of energy. This means its not really a good fit for most of our complex systems.

Subscribe to Resilience Roundup

Subscribe to the newsletter.