Resilience Roundup - The contribution of a latent human failures to the breakdown of complex systems - Issue #11

Welcome back!

This week we are discussing another article from the same issue of the philosophical transactions of the Royal Society. This one is by Dr. Reason titled “the contribution of a latent human failures to the breakdown of complex systems” which Dr Cook points out became the book “Human Error” that Reason released the same year.

Unfortunately I wasn’t able to find a copy of this paper that wasn’t behind somewhat of a pay wall. Though I was able to locate it in JSTOR which allows a few articles a month with a free membership. So if you click through and want to read it for yourself, you’ll need to register for a free account.

Please note that JSTOR gives articles like this out in an image format that isn’t screen reader friendly and doesn’t seem especially high-res for OCR. Please just hit reply and let me know if it would help for me to read this article and record it.

Estimated reading time: ~12 min


The Contribution of Latent Human Failures to the Breakdown of Complex Systems Reason begins by pointing out that a large number of disasters that had been recent at the time, including the ferry accident we’ve discussed previously, all have a number of things in common when you look below the surface.

  1. “They occurred within complex socio-technical systems most of which possessed a elaborate safety devices”
  2. “These accidents arose from the adverse conjunction of several diverse causal sequences”
  3. “It was secretly judge set appropriate human action could’ve avoided or mitigated the tragic outcome”

Reason goes on to point out something that we’ve discussed in a few different contexts here. That modern systems and technologies are now rarely vulnerable to single component failure. He goes on to point out that though this is quite a feat of engineering there is a cost to this achievement. Namely that deep barriers and defenses that have been constructed previously in the name of keeping system safe have now prevented the operators from easily or completely understanding it.

Similar to Norman’s “problem with automation” paper that we went over previously, Reason laments “Human operators are included increasingly remote from the processes stayed nominally governed”.

Reason tells us “such problems can no longer be solved by the application of still more engineering fixes nor are they amenable to conventional remedies of human factor specialist”

Reason begins to describe a model that he uses to attempt to explain how it is that humans and their behavior can contribute to the failure or breakdown of complex systems. He describes two different modes: Latent Failures and Active Failures.

Active failures are “errors and violations having an immediate adverse effect”. Whereas latent failures are the opposite. They are factors that were present in the system, typically for a very long time, before any sort of accident related events began to take place. Because these events can be very far removed both in time and in the physical world from the accident, these are typically flaws designed in the system or regulated into it.

This paper is interesting because it is a view of the move from the old view more into and towards the new view. Here Reason is developing a model we’d eventually know as The Swiss Cheese Model that Sidney Decker would later criticize and critique in his book “The Field Guide to Understanding ‘Human Error’”.

We can see this development because Reason says “there’s a growing awareness within the human reliability community that attempts to discover and remedy these latent failures will see far greater safety benefits Dan will localize efforts to minimize active failures”

Reason goes on to use a metaphor of a “resident pathogen”. The idea here is that similarly to how humans may become sick, there is not one single cause.We always have some amount of pathogens in us, it’s only when some other condition is met or introduced that we become sick.

From this view he goes on to describe five assumptions:

  1. “The likelihood of an accident is a function of the total number of pathogens”. The more there are, the increased likelihood that those that are there will be a sufficient number to trigger some accident sequence.
  2. “The more complex, interactive, tightly coupled and opaque the system, the greater will be the number present pathogens”
  3. “The higher an individual’s position within an organization, the greater is his or her opportunity for generating pathogens.”
  4. Where is it can be impossible to predict all of the possible local triggers that occur, “resident pathogens, on the other hand, can be assessed, given adequate access and system knowledge”
  5. From the previous assumptions reason draws the conclusion that those interested in safety could more “profitably” spend their efforts on identification and removal of latent failures as opposed to active ones.

Reason attempts to describe a “general framework for accent causation”, acknowledging that the resident pathogen metaphor is not itself workable theory. He goes on to state that the metaphor is also very close to be discredited to “accident proneness theory”. There was an attempt to find a sort of “accident prone personality” that yielded no results.

He differentiates this metaphor from that fruitless search by saying that the metaphor is different since it can be applied in advance, not just retrospectively.

As Reason begins to develop a broad framework, he starts by attempting to identify the “healthy” parts of complex systems,giving five:

Decision-makers

Architects and senior executives who set goals for production and safety for the system as well as direct strategy and allocate finite resources including money equipment people and time.

Line management

Specialists who implement those strategies within their specific departments such as safety, operations, or engineering.

Preconditions

Things like ensuring that the people who do the work are skilled and that the machines will work and are of the right type.

Productive activities

The actual performance of those people and machines.

Defenses

Barriers and activities to help prevent injuries or outages that can be foreseen.

Reason then goes on to diagram the model where are the final barrier has only a limited window of opportunity were an accident can occur. (see Reason’s Figure 1 below [captions mine])

Reason explains that latent failures are primarily systemic but can also be introduced at all levels of the system simply because of their nature. Because humans can be stressed, because they can make mistakes, because they can fail to perceive hazards; they have the ability to introduce these “resident pathogens” at all levels.

Reason is careful to explain “it must be excepted that whatever measures are taken, some unsafe acts will still occur”.He also continues to advocate for various defenses both physical and psychological. Stating that in a system that is well protected it’ll take several causal factors to create “a trajectory of opportunity” that could traverse there’s multiple defenses.

Reason states “a significant number of accidents in complex systems arise from a deliberate or unwitting disabling of defenses by operators in pursuit of what, at the time, seem to be sensible or necessary goals”. (this is essentially a restating of the local rationality principle).

Reason then goes on to suggest how to manage safer operations.He looks at the system as a whole and describes a number of feedback loops that he calls 1,2, 3, and 4. (See Reason’s Figure 2 below [captions mine])

Where loop 1 is between fallible decisions and accidents, these are essentially the table stakes of reporting accidents, though can’t really be used to predict future accidents as it comes much too late.

Loop 2 which is a loop between fallible decisions and unsafe acts before an accident or at least those that are observed.He suggests this loop is potentially available if there is some sort of audit procedure though he also dismisses this as being only typically given to lower levels of the organization.

He goes on then to highlight loops 3 and 4 which is a loop between fallible decisions and psychological precursors or fallible decisions and immediate line management deficiencies. He suggests that it would be most effective to influence safety by influencing the system and individual states early in this chain.First of fallible decisions, line management decision deficiencies, psychological precursors unsafe acts, unsafe acts and then finally the accident. This is the same layering model that he described before.

He uses Westrum’s classification system of the ways in which organizations can respond to safety-related information.He recounts those three reactions for us here as: Denial, Repair and Reform.

Denial actions are things like punishing whistleblowers or even altering records.

Repair actions are those that often could be public relations tactics but typically only address problems at a low level.For example: disciplining specific operators or modifying equipment.

Finally, reform actions, which is actually talking about the problem throughout the organization and acting throughout the organization. This sort of action is what leads to changing system as a whole.

As expected, Reason tells su that the more effective the organization, the closer they are to responding with Reform Action.

As a result of thinking of these three types of reactions we can then say there are three types of organizations:

  1. “Pathological organizations”: these don’t have good safety measures even normally. They’re always sacrificing safety for productivity and are often very close to the margins and may even actively try to get around safety regulation
  2. “Calculative organizations”: these are trying to do their best “by the book”. They’re usually ok, but they are often going to fail using more complex systems and thus have more complex accidents.
  3. “Generative organizations” however are the most effective.They set safety targets themselves higher than what you would normally expect. Perhaps larger than what and industry regulators require of them and they are able to fulfill these.Primarily because they are willing to try new and different things to achieve them.

Reason gives an example of this citing the La Porte group’s research that highlighted the air traffic control system, Pacific Gas & Electric, and U.S. Navy nuclear aircraft carriers.

Reason points out that all these organizations have at least two goals in common.Avoid outright any major failures that could harm or even eliminate the system as well as to safely deal with very high levels of demand or production that arise.

Reason says that each of these high reliability organizations have three distinct authority structures: routine, high tempo, and emergency.Each structure has its own traits and ways of communicating as well as different patterns of leadership practices.

This Reason says, is potentially the most significant feature of these high reliability organizations: this adaptive structure that responds to different levels of hazard as opposed to being in one mode all the time.

Routine mode is the very familiar pattern of typical rank and file type arrangement. This is going to be typically what an organization looks like it at a glance to most people. This is where organizations function based on their Standard Operating Procedures.

High tempo mode A level down from that purely bureaucratic, normal structure. This is where authority goes beyond rank and status and instead becomes based on skill; “Formal status defers to expertise”. Additionally communications here aren’t just going through top to bottom or bottom to top channels.Instead they are spanning the organization across different groups trying to accomplish specific tasks. Individuals in these high tempo groups are very aware of when any one of its members get overloaded

For example: he describes an example where an air traffic controller that might have a large number of flights that he’s looking at on a screen.This will oftentimes draw a crowd around him without any formal process being initiated, that will help him by pointing out signs of danger. This communication mode is characterized by very little actual spoken word but are often pointing at the screen. And then once the load has reduced the group automatically drifts away as well on its own.

Emergency mode is one that is started by obvious signs that a hazard is imminent. Authority in this mode is based upon preprogrammed and predetermined and rehearsed duty allocations. “Individuals regroup themselves in a different functional units on the basis of a predetermined plan, tailored to the particular nature of the emergency.”

This is really the big takeaway for me.That organizations have 3 modes within them to fit a situation, not necessarily just one mode that they’re stuck in, which La Porte’s research indicated wasn’t well documented at the time.It helps me to be more aware of the mode the organizations and teams I work with operate in.

Also, understanding the swiss cheese model as it developed, remembering that some organizations have stopped there in their learning and looking at systems, helps me be cognizant of the organizational perspective.Allowing me to narrow “the gap between work as imagined and work as practiced.” as Dekker and Woods would say.

Reason closes by asking a number of questions about high risk and high reliability organizations such as if it’s possible to build adaptive structural ingredients into these organizations or do they have to evolve that way? Is it necessary for there to be some amount of growing pain?

He concludes that it’s too early yet to tell but suspects that further study of high reliability organizations might reveal this.

“Just as in medicine, it is probably easier to characterize sick systems rather than healthy ones. Yet we need pursue both of these goals concurrently if we are to understand and then create the organizational bases of system reliability”


Don't miss out on the next issue!