Nine Steps to Move Forward from Error

Quotes:

“To understand failure, understand success in the face of complexities.”
“To understand failure, look at what makes problems difficult.”
“Safe organisations deliberately search for and learn about systemic vulnerabilities”
“As capabilities, tools, organisations and economic pressures change, vulnerabilities to failure change as well.”

This is a paper from Cognition Technology and Work by David Woods and Richard Cook, in a different format than the usual. In it the authors discuss nine very concrete steps that organizations can take after a “celebrated failure”.They go over a number of maxims and collieries, but ultimately leave us with a checklist we can run through to ensure we are being effective, all based on research of course.

After incidents or accidents occur, parts of an organization can begin to ask unhelpful questions about the accident. Often resorting to human error as an explanation. The authors when this can be because the stakeholders involved have a hard time acting on lessons from the years of research on human performance since they conflict with conventional views and can require sacrifice, for example, the case of production versus safety.

The checklist

Pursue second stories beneath the surface to discover multiple contributors.
Escape the hindsight bias.
Understand work as performed at the sharp end of the system.
Search for systemic vulnerabilities.
Study how practice creates safety.
Search for underlying patterns.
Examine how change will produce new vulnerabilities and paths to failure.
Use new technology to support and enhance human expertise.
Tame complexity through new forms of feedback

Step 1: Pursue second stories

First stories are those that have been right after the incident. They are typically biased by hindsight and are overly simplified. Often they will simply point to human error is a cause. This is problematic because it prevents learning and masks what really happened: what pressures people were under and what adaptations were normally successful.

These first stories can be attractive because they are easy to tell and understand, but they are usually wrong. You can recognize first stories because the “solutions” that they offer don’t really make sense or wouldn’t be very effective.

You can see that something is a first story because it may end up calling for others to “just be more careful” or there may be finding like “we can do a better job if only…”. Or if a response is something like “we need to take humans out of the loop”

Of course, if you’ve been following along here, you already know that in order to make progress on safety we need to move beyond these cursory first stories and find out what really happened, what’s being masked by that label “human error”.

This is where the “second story” comes in: finding out what really happened.

Step 2: Escape from hindsight bias

This is an extension of the second story idea. The first story is biased by hindsight so we need to make a concentrated effort to not have that same bias. We can try to put our ourselves in the shoes of the people experiencing the incident as it unfolded instead of only considering our knowledge of the outcome.

Looking for difficult problems can help with defeat those first stories.Look for what difficult problems are being solved, what makes them difficult, and what strategies are being developed to face them.Taking a number of these cases together, it can be possible to see how strategies could be applied elsewhere or improved.

Step 3: Understand the work performed at the sharp end of the system

The sharp end of the system, in contrast to the blunt end, is where the practitioners actually do the work, interacting with the system. This is where the realities of the situation may not match management’s idea of how work is performed. The people at the sharp end are the ones who are facing trade-offs and goal conflicts that they are having to resolve, and yet still often creating safety. Supporting these people and investing here as well as understanding how these practitioners cope is a way to improve safety.

Step 4: Search for systemic vulnerabilities

Safety is not a single person, department, device or component. It is an emergent property of systems. It is created through systems work. As a result, it’s important to find systemic vulnerabilities and not just human problems.

When we examine that technical work in context will come to find a number of trade-offs and problems and areas where it’s possible to fail. And will also begin to notice how it is that people have been coping with this. Once we have an understanding of this we can then go forward and look at how those adaptations can be themselves vulnerable. Understanding this can help us anticipate future vulnerabilities.

Step 5: Study how practice creates safety

In the face of failure, people tend to assume that the system is normally safe and that a given failure occurred simply because of some unreliable element that happened. This is not the case at all. As mentioned previously, safety is something that people create. It is important to study how this happens by examining the adaptations and tradeoffs being made.

Step 6: Search for underlying patterns

Often if a failure triggers a “hot button” issue, people have a tendency to only look at the surface of the problem. Looking beyond the most obvious, surface issues and into what patterns are present in the system itself is key in moving beyond the failure itself.

Step 7: Examine how economic, organizational and technological change will produce no vote of vulnerabilities and paths to failures

“As capabilities, tools, organisations and economic pressures change, vulnerabilities to failure change as well.”

Because systems exist in a world that is always changing the systems themselves are changing as well. Typically systems will also drift towards failure as the defenses that were previously planned are slowly eroded because of production pressures or that change.

Research supports this idea that hazardous operations are actually really successful when they anticipate and plan for unexpected events. See Issue 9 (The High Reliability Organization Perspective) for more about this.

Understanding the Law of Stretched Systems can help: “systems under pressure move back to the ‘edge of the performance envelope’”.In other words, when improvements are made to systems, the gains are typically taken in the form of productivity instead of safety. Further, these changes tend to increase coupling.Increased coupling increases complexity and increased complexity causes increased problem difficulty.

Step 8: Use new technology to support and enhance human expertise

There is an idea that it’s relatively easy to improve certain areas by just throwing more technology item. Research doesn’t support this. The thought is often that just adding technology will reduce those pesky humans and therefore prevent failure.

The idea that adding just that much more tech will solve problems in operations has not actually come to pass in practice.This is because of the complexity underlying operations that can be hard to see. So adding more complexity with new technology while it can have benefits, also can create new modes of failure.It’s also not effective to treat people and computers as separate when they interact together to do cognitive work.

Step 9: Tame complexity through new forms of feedback

Failure can be seen as “breakdowns in adaptations”. That is, failure occurs when the adaptations that we’ve made to help us cope with complexity stop working.Success is when organizations, groups, or individuals recognize the need to adapt in the face of potentially negative outcomes.

Since improvements in these areas require us to see the effects of decision-making and the effects of actions, it’s critically important that we have some sort of feedback.The authors explain that: “In general, increasing complexity can be balanced with improved feedback.” and that “The constructive response to issues and safety is to study where and how to invest in better feedback”

Better feedback:

Captures patterns, not just a large set of available data
Represents events, not just one value of a monitor
Looks forward, not just a record of the past

The authors close by giving us some final advice “organisations need to develop and support mechanisms that create foresight about the changing shape of risks, before anyone is injured”.

Subscribe to Resilience Roundup

Subscribe to the newsletter.