Applying systems thinking to analyze and learn from events

This week we’re going to talk about Nancy Leveson’s paper, Applying systems thinking to analyze and learn from events. It touches on several themes we’ve talked about before and really helps tie them together and even cites some familiar authors including Woods and Rasmussen. Starting with a familiar format, using an accident to provide a jumping off point, she tackles a number of assumptions that are being made today that are incorrect, or perhaps even dangerous.

Estimated reading time: ~10 minutes


Applying systems thinking to analyze and learn from events

Leveson opens by explaining that the reason she wrote this paper and the reasons for others in this issue of Safety Science are that learning from accident analysis and experience has not been as effective as one would think.

She goes through and systematically questions a number of assumptions. Starting with:

“ Safety is increased by increasing the reliability of the individual system components. If components do not fail, then accidents will not occur.”

Leveson explains that “Safety and reliability are different system properties”. She attributes much of this confusion to researchers on high reliability organizations and the suggestion that those same organizations are will also be safe organizations as well as a focus on failure events in accident investigation.

She provides the example of the Mars Polar Lander to support this point. Leveson explains that the loss of it was determined to most likely be the “spurious signals” generated when the legs were deployed that the software interpreted as a sign that landing was complete and stopped the engines. She points out that the noise was expected, so isn’t a failure of the landing leg system. Further the software was designed to notice such noise, which it did. So both pieces did what they were designed to do.

So if we can’t say failure caused the accident, then what can we say did? Levinson tells us “the accident occurred because the system designers did not account for all interactions between the leg deployment and the descent-engine control software.”

She echos Rasmussen (who she cites throughout), saying that the time in which we could take pieces of a system and test the independently and still get a thorough understanding of how the resulting system would function, has passed. We must now consider system design errors as a cause of accidents.

To continue to help us understand and question the assumptions that are made in other accident analysis she turns to Rasmussen’s work on the Zeebrugge ferry.

We won’t go into extensive detail on the incident, but for background, this is an incident where the ferry Herald of Free Enterprise capsized killing 193 people. The ferry was running a different route than her usual, one where the linkspan (a kind of drawbridge) at the port, wasn’t designed for that size of ship. It had a few limitations including that it couldn’t be used to load multiple decks at the same time, and could not reach a high deck because of higher tides.

This was a known issue and one for which there was a work around, fill forward ballast tanks, making the bow sit lower in the water. Typically before dropping moorings someone from the crew, usually the Assistant Bosun would close the bow doors, then the First Officer would stay on the deck to make sure they were closed, then head to the wheel house.

On this day though, the First officer headed back to the wheel house before dropping moorings as he was in a rush, though this was apparently a common thing, leaving the doors to the Assistant Bosun.

As you may have guessed on this day he went back to his cabin for a short break after doing some work. The captain assumed the doors were closed as he had no way of seeing them from the wheel house. It is not known why no on else closed the doors. Just this though would have been unlikely to capsize the ship since a sister ship had made the trip in previous years with the door open and didn’t capsize.

Looking further, Leveson tells us that the depth of the water and the ship’s speed may have contributed. If it hadn’t been for the shallow water, and had the ship been slower, perhaps those on the open car deck might have had a chance to notice the doors were open and closed them. Even then it may not have been enough to capsize the boat, but the car deck, being completely open had no water tight compartments, no dividers, just open space for the cars to drive on and off on which could then flood. This water was able to shift when the ferry turned, at which point the ship capsized.

Leveson explains that:

“Those making decisions about vessel design, harbor design, cargo management, passenger management, traffic scheduling, and vessel operation were unaware of the impact of their decisions on the others and the overall impact on the process leading to the ferry accident.”

Each decision made from the bottom up could be correct or “reliable” within the limited context, but could still lead to an accident due to the interaction. Each component of a system could be reliable in certain conditions and operating environments and accidents can still occur. She points out that eventually all components can be broken given enough time or extreme conditions, but that doesn’t mean a component fails if there is an accident.

She gives the example of a driver slamming on his brakes. If he does so too late, and hits the car in front of him, we wouldn’t say the brakes failed just because they didn’t prevent an accident in a condition they were not designed for.

So then, what of safety? We can say simply that safety is the absence of accidents. Then safety is a property of the system, not of an individual component. Levinson drives this point home explaining:

“Determining whether a nuclear power plant is acceptably safe, for example, is not possible by examining a single valve in the plant, and evaluating the safety of a hospital clinical care unit is not possible by examining a single step in a surgical procedure. ”

Because of this, Levinson suggests that we turn to systems thinking and systems theory. She recommends we think of unsafe behavior in terms of safety constraints. Doing so allows us to then say that safety is a control problem instead of thinking of it as a failure or lack of reliability problem.

She also echoes a point we’ve seen in our previous examinations of human error:

“When the results of deviating from procedures are positive, operators are lauded but when the results are negative, they are punished for being unreliable”

She closes her destruction of this assumption, by telling us, “Analyzing and learning from accidents requires going beyond focussing on component failure and reliability…. Top-down analysis and control is necessary to handle safety”

Leveson then goes on to tackle the assumption:

“Retrospective analysis of adverse events is required and perhaps the best way to improve safety.”

She begins by again citing Rasmussen’s work, explaining that systems are rarely static since both systems and organizations will adapt in change. And this adaptation is not just random change, but “is an optimization process depending on search strategies – and thus should be predictable”

Because of this retrospective analysis becomes limited. And relying too heavily on it, can itself cause loss. Leveson gives us the example of the loss of a Mercury satellite in 1998 when “quality assurance only checked for those things that had lead to a satellite loss in this past”. This time, a typo was made for the launch data, which hadn’t occurred before.

Leveson also tells us that software, is also changing just by running in different environments, so we can’t make the assumption that just because it was safe before, it will be safe again.

So again, we must consider the performance of the system as a whole. This has implications in how we analyze accidents as well. We can’t simply create a causal tree or chain and expect to find causes we can change to prevent accidents. She goes back to the example of the ferry; we could construct a chain of events where “it would appear the root cause was the Assistant Bosun not closing the doors and the First officer not remaining on deck to check the doors” but that doesn’t account for things that aren’t “events” like the way the deck was built, the need to be on a time crunch, etc….

This should sound familiar as Rasmussen told us before, that root cause is often just where we stop looking. Instead Leveson goes back to her previous work and suggests that causes can be seen as three different levels:

  1. The “basic proximate event chain”. This includes things like not closing the doors, or the First Officer leaving early.

  2. “conditions that allowed the events to occur. This level includes things like the higher tide, the loading ramp problems, the pressure to be on time.

  3. “systemic factors that contribute to the conditions and events”. This level contains stuff like the ferry company seeking boast with fast acceleration to remain competitive.

Most accident investigation stops at level 1, she warns us. Because of this, we never have the opportunity to address levels 2 and 3, where changes there might prevent future accidents.

Leveson also uses these arguments to attack the assumption “Accidents are caused by chains of directly related failure events”

Finally she also attacks the two part assumption: “(1) Most accidents are caused by operator error and (2) rewarding “correct” behavior and punishing “incorrect” behavior will eliminate or reduce accidents significantly.”

Leveson had previously talked about this a bit before, and also reminds us “human behavior is always influenced by the environment in which it takes place”. She warns “we design systems in which human error is inevitable and then blame the human not the system design”.

She cites Dekker’s work, telling us that if we truly wish to understand what really causes accidents and to prevent other ones. We must avoid making the goal to assign blame, as “blame is the enemy of safety”, but instead figure out why these people did what they did.

Leveson closes by leaving with us with a template on how we might do better. First by documenting or improving documentation of a the existing safety control structure. Then examine that structure for inadequacies. Not just in the physical system, but the design and system. She reminds us that if we investigate accidents thoroughly, we be able to learn a lot from only a few incidents:

“Given the number of incidents and accidents that have identical systematic causes, simply investigating one or two in depth could potentially eliminate dozens of incidents”

← The high reliability organization perspective
The 'problem' with automation: inappropriate feedback and interaction, not 'over-automation' →

Subscribe to Resilience Roundup

Subscribe to the newsletter.