Resilience Roundup - Issue #9 - The high reliability organization perspective

This week we are discussing a chapter of a book, the High reliability organization perspective by Sidney Dekker and David Woods.

Estimated Reading Time: ~10 minutes


The high reliability organization perspective This is a chapter from Human Factors in Aviation, (unsurprisingly to you I’m sure!), an industry and some authors that we’ve looked at a few times here.

Specifically in this chapter, the authors Dekker and Woods talk about high reliability organizations, the research around them and the things that can be learned for aviation in that research.

Even though this chapter is geared towards what aviation can learn, just like every week here, we’re also going to take a look and see what we can apply to the domain of software.

First Dekker and Woods take us through the origins of high reliability organizations and explain a bit about what they are.

They explain to us that high reliability organizations specifically are trying to “pull learning forward in time”, and they are organizations that have avoided accidents in situations where they’d be normally expected.

A core tenant that is a good thing to keep in mind is something that Woods and Dekker go over a few times with us specifically “safety is not something an organization is, it’s something an organization does”. Just as in the material we’ve taken a look at in the last few weeks, they go over the lack of safety information that can be derived from looking at individual components, this should be familiar by now, they go over the same sort of concerns that socio-technological systems are complex enough that we can’t simply look at individual components to determine the overall system safety.

“Simple things can generate very complex outcomes that could not be anticipated by just looking at the parts themselves”

Also citing work that Woods did with Leveson (a name that should sound familiar) and Hollnagel in 2006

“instability emerges not from components, but from concurrence of functions and events in time. The essence of resilience is the intrinsic ability of a system to maintain or regain a dynamically stable state”

In the latter part of the chapter of the authors give us with some direct lessons that we can learn to help ensure resilience in high reliability organizations.

The first lesson is not taking past success as a guarantee of future safety.

This was pretty self explanatory but a few points jumped out at me:

“operators treating their operational environment not only as inherently risky, but also actively hostile to those who miss estimate that risk” “confidence in equipment and training does not take away the need operators see for constant vigilance for signs that a situation is developing in which that confidence is erroneous or misplaced”

The next lesson they give us is “ distancing through differencing”.

This is, as the authors explain, a problem that can occur in organizations when they either look at other organizations, other departments, other industries, sections of the company, and they say, oh, well, they have different technical issues. They have a different environment. They have different management. Their history is different than ours. All these different things and they focus on all the way those departments in those areas are different and fail to learn from them instead of looking at what things are similar, what patterns are similar and what they can learn from that information.

Next, “fragmented problem solving”.

This, the authors warn us, is an issue when you divide up any sort of problem solving attempt and fracture it across different silos, whether that’s sections, contractors vs non-contractors, departments, teams, whatever. Whenever you have a lack of free flowing information, whenever take away the ability for people and teams to be conversant and exchange information, this problem can occur where no person or group has a good model of the system in their head.

So it makes it really easy for them to miss information. They don’t have a lot of background, they don’t have an overall picture of the system. Specifically, the authors also looked at handovers like during a shift change and they do cite that as one way to coordinate effort to stop this problem of fragmentation. They cite Patterson’s research in 2004 to explain some of the potential downsides if you fail to transfer this information or if you fail to do a handoff.

These costs for the incoming crew include:

  • Having an incomplete model of the system state
  • Being unaware of significant data or events
  • Being unprepared to deal with impacts from previous events.
  • Failing to anticipate future events
  • Lacking knowledge that is necessary to perform tests safely
  • Dropping or reworking activities are in progress or the team has agreed to do

Creating an unwarranted shifting goals, decisions, priorities, and plans

This situation might sound familiar to you. It’s likely one that you yourself may have experienced. Perhaps you joined a conference bridge where information wasn’t being shared, it wasn’t available. Perhaps you joined later than some others and no one filled you in. This list from Patterson is likely very familiar, especially having that incomplete model, the system, state and being unaware of significant data or events. Just from this alone, we can sort of work backwards and keep these in mind as we transfer knowledge to others and sort of ask ourselves, “are we creating these sorts of situations; that fragmented problem solving situation? Are we failing to do handoffs?”

Next, “The courage to say no”.

This approach is a blend of both of the authors’ previous work, where they discussed that it’s important to have someone within the loop of the system in the organization that can go against what might be the common wisdom or a common reading of the data.

Specifically someone with “authority, credibility and resources”. This is because, over time trade offs can be made in organizational goals that will eventually lead to a sort of tunnel vision, a narrowing focus, obscuring or forgetting about other goals. Often this can look like trying to hurry and get to production or get to a certain efficiency and failing to take into account long term goals like safety.

“Sometimes people need the courage to put chronic goals ahead of acute short term goals. Thus, it is necessary for organizations to support people when they have the courage to say no in procedures, training, feedback on performance as these moments serve as reminders of chronic concerns even when the organization is under acute pressures that can easily trump the warnings.”

“ability to bring in fresh perspectives”.

In this section, the authors explain, citing several sources that having people who can bring new perspectives into problem solving, into sorts of brainstorming, people who have different backgrounds and diverse viewpoints, “seemed to be more effective. They generate more hypotheses, cover more contingencies, openly debate, rationales for decision making and reveal hidden assumptions”

Some HRO studies have shown that this occurred naturally in teams that had a constant rotation of their personnel throughout different areas so that they were constantly introducing this fresh viewpoint into different areas. Obviously, how we apply this at our individual organizations and teams is going to vary a lot by their size and their structure and their ability to reach others. But this is something to also keep in mind when problem solving or perhaps even in incident response

“the alternative readings that minority viewpoints represent, however, can offer a fresh angle that reveals aspects of practice that were obscured from mainstream perspective.”

The authors also explain that all of this can also help just keep a discussion going about what is risky especially when everything seems to be safe. A lot of organizations might stop talking about this. It turns out that just continuing to think about a safety and recalibrating our models of what is safe and what is risky is something that can be helpful. This can prevent us from having too much or that misplaced confidence that everything is perfectly safe. It’s important to note that this isn’t the same as thinking that everything is always going to be catastrophic. This is not “The sky is falling” sort of thinking:

“extreme confidence and extreme caution can both paralyze people and organizations because they sponsor a closed mindedness that either shuns curiosity or deepens uncertainties”

“knowing the gap between work as imagined and work as practiced.”

This approach is going to be very familiar to those of us who have looked into DevOps practices. This approach is essentially just saying if your management or stakeholders or different people have a different idea of how it is that the system is actually managed, the larger that gap in their understanding, the more likely it is that, they might be misunderstanding or missing out entirely on how safety is being created in these systems.

Obviously if they think that certain operations are very easy, very safe, and perhaps are making design decisions or directions towards those things, if they have a misunderstanding of what it’s like to actually operate the system, then narrowing that gap, whether it is through DevOps cultural shift would help address this outcome.

The authors don’t specifically leave us with a way to do it as this is clearly a very organization based.

Finally, “monitoring of safety monitoring or monitoring”.

This is essentially just the idea that if we develop certain models of what is risky and we develop ways to deal with that risk or plans or policy, we could also be looking at how those continue or don’t continue to match reality over time. Are they still effective? Do they still represent the risks that are potentially present? Do they still represent things we care about? This sort of meta monitoring is what the authors mean here. The authors close the chapter as a whole by discussing a potential both language shift and thought shift from high reliability organizations to high resilience organizations.

Again, emphasizing that safety is not something that these organizations have. It is something that organization do. The authors remind us that the organizations and their operators are operating as adaptive systems. They are continually looking at their work and changing their process, their approach to the work itself, so that they are sensitive to this notion of failure.

They explain traditional ideas of reliability in engineering, component level safety that we’ve discussed before, has been optimized almost as far as it can be increasing gains in the field of aviation at least are unlikely to yield much. They even suggest that it might become problematic over time. Instead, the author’s close by suggesting that at least in aviation, and this is something we could of course learn from software that they moved from more of a reliability model to that of a resilience, which of course is built on some of these same ideas.

Specifically they suggest that looking at technical hazards is an input into resilience engineering. But actually the overall goal should be looking at organizational decision making, specifically looking at evidence that cross checks are well integrated when risky decisions are made or that the organization itself is allowing for practice at handling various simulated problems. And then also of course monitoring which problems are practiced.

We’re also left with a list of dimensions of risk that organizations can map themselves against:

  • Pressures of production versus safety
  • Learning culture or that of denial
  • Proactivity in noticing evidence of problems or reactivity
  • Monitoring safety through multiple people and levels or isolating that to only a small group or even an individual
  • Flexibility versus stiffness.

Finally, the authors leave us with three points, that they believe highly reliable aviation organizations in the future will have become good at. These three basics of resilience engineering are:

  1. Detecting signs of increasing organizational risk, especially when production pressures are intense or increasing
  2. Having the resources and authority to make extra investments in safety at precisely the times when it appears least affordable.
  3. Having a means to recognize when and where to make targeted investments to control rising signs of organizational risk and rebalanced, the safety and production tradeoff.

These mechanisms will produce an organization that creates foresight about changing risks before failures and harm occur.

Exactly the sort of foresight we could use more of.


Don't miss out on the next issue!