Incident Reviews in High-Hazard Industries: Sense Making and Learning Under Ambiguity and Accountability

Thanks to Lorin Hochstein for bringing this paper to my attention!

Though this paper by John Carrol is about nuclear power plants, it could very easily be about software teams and companies today. What is especially interesting about that is that this paper is from 1995, it again seems that we in software are on our own journey (re)learning lessons that other industries had discovered previously.

While software is unique in some ways, we still are subject to the same constraints as other complex, socio-technical systems and run into similar problems when attempting to learn from incidents.

Incident review is an important part of the organizational learning process, but it can be practiced in a way where the focus shifts away from learning to fixing. This creates a few issues such as:

Root cause seduction
Sharp end focus
Solution driven searches
Account adaptability

Incident Example

The author provides an example of an incident and the review to demonstrate some of the issues that can occur.

The incident took place at a nuclear power plant where operators in the control room suddenly learned that their monitoring equipment had stopped working over an hour ago.

I think this is something we can relate to in software, as without many of our monitoring tools we can’t see or determine how the system is doing. As with this plant, many organizations I’ve worked with have treated this as a very pressing incident.

During the previous downtime, a scheduled maintenance period, the control room had moved from an electromechanical system to a computerized system. Now, several months later an operator had placed a switch in an incorrect position while also executing some commands that cause the computer to lock up. The author tells us over the next hour or so that there were opportunities for operators to notice the state, but it’s unclear what they were or how obviously were, but either way they were not seen.

An “event review team” (ERT) was brought together to investigate the incident. The team was created from multiple groups across the plant including engineering, training, operations, and quality assurance. They worked full-time for 10 days, throughout which time they interviewed personnel, reviewed procedure, industry reports for similar events, design docs, and finally performed a root cause analysis.

Ultimately the review team’s report determined that the incident was primarily caused by “software design flaws (inadequate security to prevent inadvertent access to software control functions) and operator failure to follow procedure.” They specified secondary causes as “inadequate design specifications, incomplete procedures, lack of operator training on critical aspects of the system, lack of follow-up by the design team of system failures during installation, misleading information given to the operators about the computer.”

This is likely unsurprising if you’ve seen other very blaming “causes” found in similar reports. Interestingly in this one the operators were blamed, but the misleading information that was listed as a secondary factor was that they were told they could not do anything wrong to the computer. The report also went on to discuss why the failure wasn’t detected sooner and gave the usual suspects of reasons including “lack of awareness” and long-standing “nuisance alarms,” where alarms were expected for a number of actions or even defective alarms.

Of course issuing a report is not the end or even necessarily part of the learning process, what actions are taken as a result, what lessons are learned from the report, and how it’s interpreted by the organization and its members. In this case the operators that were “involved” in the incident (which really just sounds like anyone who was around during it) was disciplined. But even though the report had mentioned engineering design no engineers were disciplined since the engineering department “considered that no individuals were responsible for the design.”

Root Cause Seduction

Here, there are some cultural assumptions at play. The author suggests that it was not necessarily the culture of the power plant, but an overall engineering culture that contains assumptions that guided the diagnosis and further action. As we can tell from the existence of a root cause analysis in their process and as many of us have seen from her own experience that much or even all of an incident analysis can be based on the assumption that some given event could be traced back to root cause.

While the example report did list a number of causes we can still see a very linear cause and effect relationship thought process at work where ultimately operators were determined to be the cause. This sort of thinking can be tempting as it sidesteps any ambiguity or uncertainty that could occur when analyzing a complex system. The author also suggests that:

“the seductiveness of single root causes may also feed into, and be supported by, the general tendency to be overconfident about how much we know.”

As we’ve (discussed before)[https://resilienceroundup.com/issues/human-error-and-the-problem-of-causality-in-analysis-of-accidents/] where one labels something in a timeline as “cause” is ambiguous and can vary widely from person to person. The author cites (James Reason)[https://resilienceroundup.com/issues/the-contribution-of-a-latent-human-failures-to-the-breakdown-of-complex-systems/] and explaining that it’s incredibly unlikely that there would be any single cause for any surprising event, but instead a “confluence of behaviors, conditions, and circumstances that have developed over time.” Further, anything labeled as a cause would itself have some sort of causal chain attached to it that can be arbitrarily long.

Even the categories that are used to group or identify causes (as we saw on the report, poor design versus insufficient training) is another arbitrary labeling that is very subjective.

Sharp End Focus

Because the sharp end is seen as the place where the last human barrier to disaster exists, these are the people who tend to be held accountable. This is with the reasoning that these are people who had choice and “could have” prevented an incident. Multiple researchers have studied causal reasoning and observed this type of fundamental attribution error where there is increased focus on people “who could have done otherwise.” In this situation, those on the outside tends to attribute actions and outcomes to the participants, whereas those participants site features the situation as causes of the incident.

This focus on the sharp end naturally stems from attempting a diagnosis by starting with the incident that occurred. Of course when you engage in some sort of working backward process following what erroneously appeared to be direct trails and links for each element of the incident it is easy to fall into the trap of blaming the people who touched the things as opposed to people who designed the things. Especially when we consider the bias to focus causal explanations on things that are nearby in space or time.

This hindsight bias shows up over and over in a number of ways in incident or event reviews. One example is that there have been chemical plants that have gotten good or great marks on thorough external reviews but a short time later have some sort of accident. Post incident reviews in those cases have found those incidents to be “accidents waiting to happen.”

I like the authors summary of the problem:

“In short, how compelling a diagnosis seems after an incident is not always a good indication of how easily the incident could have been foreseen and prevented; diagnosis prior incident is many times more difficult.”

Solution Driven Search

This problem area stems from the fact that

“Engineering training emphasizes the importance of finding solutions and transit students on problems that have known solutions.”

From this perspective, diagnosing and problem solving is a search through a list of solutions instead of the much harder task of new design problems. Even here we see some issues with labeling as the author points out that even categorizing experience into “problems” and “solutions” masks the much more complex interrelationships at play.

An engineering executive at a nuclear power plant is quoted as saying: “it is against the culture to talk about problems unless you have a solution.” This in combination with confirmation bias creates a situation where there is a very strong biased toward solutions that are familiar and may make it hard to see problems that fall outside of whatever well-defined categories the organization is using. Even those problem categories are generally organized by solutions.

Account Acceptability

“Because the team to diagnose operational incidents or members of the society or culture was short assumptions, and because air accountable for the report, they are encouraged in various ways to put for acceptable accounts of incidents, their causes, lessons learned, and improvement strategies.”

This can manifest itself in several ways, you may have experienced it yourself when in a post incident review where it seems that everyone knows something, but “can’t” put it in the report, or people may be especially careful what they say about contributing factors if they know it’ll be published widely.

This acceptability pressure occurs across the organization. Carroll points out that another nuclear power plant submitted reports to the Institute for Nuclear Power Operations so that it could be rewritten with the “right” words.

Moving from Fixing to Learning

At a high level, the way to avoid these traps is to ensure that incident review processes our truly geared towards learning and not fixing. This is critical as the author points out:

“In complex, high hazard industries, it is not sufficient to honor specific expertise and celebrate ‘heroic’ individuals who come up with fixes. No one person can know enough. The best answer keeps changing. Organizations are continually trying new things, and information must flow across specialties to where it is needed. People learn with others and through others; the sources of new ideas are not easily predictable”

This can start with shifting our expectations. Instead of expecting that some root cause or specific solution be found, we can expect that the process will help us learn more about the system and the people who operate it.

Takeaways

Incident review is an important part of the organizational learning process.
Incident review is often practiced in a way that is oriented towards fixing things as opposed to learning from them.
There is of course nothing wrong with fixing things, but it can prevent learning.
This fixation with fixing can cause a number of issues, such as:
- Root cause seduction
- Sharp end focus
- Solution driven searches
- Account acceptability
One way to help combat a fixing focus and learn from incidents is to focus on the learning itself, without expectation of fixing.
Learning from experience is difficult in complex, high consequence, tightly coupled fields, including software, but is all the more important that it occur.
“How compelling a diagnosis seems after an incident is not always a good indication of how easily the incident could have been foreseen and prevented; diagnosis prior incident is many times more difficult.”

Subscribe to Resilience Roundup

Subscribe to the newsletter.