Resilience Roundup - Illusions of explanation: A critical essay on error classification - Issue #42

Illusions of explanation: A critical essay on error classification

This is a paper by Sidney Dekker in which he discusses the problems with how error is defined, or rather how it’s poorly defined, and also with how it’s considered and counted as a result.

Dekker addresses three issues:

  1. Classification of errors is easily mistaken for analysis and deeper understanding.
  2. Finding deeper reasons for the observed error is often a matter of finding other errors, either inside the heads of people affected, or by other people.
  3. Safety is modeled as the absence of “negatives” (errors), misguiding managerial interventions.

The crux of the paper is that since “error” gets defined so many different ways, it becomes less useful. Further, “errors” are often counted and then compared to say whether or not a given intervention policy change worked. If there are fewer errors counted, then it’s seen as successful.

He looks at the aviation industry specifically, though much of or what he discusses isn’t very specific to aviation at all. It doesn’t take much extrapolation to see how it’s relevant to us in software. I think as a field there is a trend to focus on the quantitative and ignore qualitative information available.

What Does Error Mean?

Part of the trouble around counting errors as a means of measuring safety occurs because error can mean many different things. It could mean:

  • Error as a cause for failure, such as when saying an incident occurred because of human error. For example a classification may use this definition when trying to explain the cause of an operator error, “a supervisor’s ‘failure to provide guidance’”
  • Error as the failure itself. This definition is typically used by classifications to try to categorize observable errors that operators can make, usually perceptual or skill-based.
  • Error as a derivation from some process or standard.

Just like most methods of classification, since these different types aren’t differentiated between the classification system that Dekker discusses, Line Oriented Safety Audits, can’t differentiate between cause and consequence which hurts the ability to understand error.

Not Scientific

As Dekker points out, a lot of the assumptions that are made when their classification is performed or chosen as a solution haven’t been tested in any scientific way. And due to the way they disassociate the count from the context, there really is no way to go back and verify or see if you draw the same conclusions.

Further, since there is no clear definition around what error is, then whatever gets counted is potentially just a representation of what the individual counter thinks of as error. Dekker likens this arrangement to pseudoscience or numerology.

Removing Data From Context

Another issue is that these systems remove the context of the situation that the behavior or event occurred in. Forcing a complex situation into simple labels about category make it much harder or perhaps even impossible to understand what occurred.

Safety as the absence of negatives

Error counting and classification methodologies all assume that safety is just a given thing. As a result they assume that there is a close relationship between the number of errors and the amount of safety.

Of course, safety is more complex than this and arguably doesn’t even include managing those “negatives.” Safety is not something that is out there to find, that lives outside of the people doing the work.

Safety is created by the people doing the work locally which is shaped by what they believe about the situation or system, what they’ve experienced before and the assessments they make.

Moving Forward

Dekker’s main suggestion on how to move forward is to reduce or eliminate the emphasis on constructing causes as explanation. Error categorization and classification systems make this search for some cause faster, but because is not a set thing that we can go find, it’s constructed based on where we under investigation and where we look.

Labeling certain decisions or actions as “error” is completely arbitrary and doesn’t yield further understanding. Instead we can look at patterns or as Dekker calls it “genotypical mechanisms of failure,” that may be recurring.

Takeaways

  • Not having a definition of error makes error counting impossible to verify.
  • Safety is not simply the absence of negatives, so even the best error counting methodology would not necessarily help create safety.
  • Safety is created by people’s actions, which are shaped by the assessments they make and their beliefs.
  • Taking context away from the situation and reducing to a tally in a category makes it impossible to learn from and understand the error

Don't miss out on the next issue!