Resilience Roundup - Learning from Automation Surprises And ‘Going Sour’ Accidents - Issue #41

Learning from Automation Surprises And “Going Sour” Accidents

This is a chapter by David Woods and Nadine Sarter in which they discussed the results of their findings across many years and many studies. The paper draws on a number of investigations that the authors have done together (some of which we’ve covered before) and also adds in surveys, the experience of other operators, and takes from other studies of incidents as well. The authors even worked with an FAA team that research how flight crews and modern flight systems work together.

The chapter is from Cognitive Engineering in the Aviation Domain. As the title suggests, the authors look at the influence of automation in commercial airliners and how it affects human performance.

They start off with a number of questions I think are pretty similar to questions we are beginning to ask yourselves as developers and operators of software systems that have increasing automation.

What are these problems with cockpit automation, and what should we learn from them? Do they represent over automation or human error? or instead perhaps there’s a third possibility — they represent coordination breakdowns between operators and the automation?

This chapter focuses on a specific pattern of incident that can occur that they call “going sour”.

Looking broadly at all the research that they done in this area, the authors noticed a pattern of “automation surprises”. This is where humans become surprised by what the automation does or does not do in response to a certain situation or input. Automation surprises start with some sort of miscommunication and mis-assessment between the automation and the human user. This then creates a gap between that users understanding of what the system is doing and perhaps more importantly what it will do in the future. This can often begin with mode error (more on mode error).

Interestingly, detection of this situation doesn’t actually come from the displays and status gauges that are available, but actually occurs when some sort of unexpected behavior happens. Unfortunately, this might be too late for the crew to have time to intervene and avert disaster.

Based on the evidence, they found that the chance of having an automation surprise is the highest when three factors happen at the same time:

  1. Automated systems acting on their own without immediately preceding human input
  2. Gaps in the users’ mental models of how the automation works in various situations
  3. Weak feedback about what the automation is doing and what it will do in the future based on the state of the system or environment

As a field that often creates automation and also has to interact with it, I think we’re in a unique place where it’s especially important to keep these things in mind. This wasn’t the only form of coordination breakdown between automation and crews that the author saw, but I think it’s probably one of the more familiar ones for our field.

These factors being present or automation surprise itself occurring, of course doesn’t mean that there will be disaster. Most of the times that these things occur there aren’t really any significant consequences. Which sounds great, but in a way actually fuels the problem. The authors point out that there is a series of events, a pattern that they’ve noticed where:

  • automated systems affect humans in predictable, sometimes negative ways
  • there are some sort of precursor events that happen where these problems occur, but in ways where the outcome is okay or other actions kept it from going badly
  • occasionally though, these occur with some other contributing circumstances and events spiral into disaster

Going Sour Accidents

So we can see from the factors that create automation surprise and the pattern above that this is a general category of accident that isn’t specific to aviation. In fact this pattern has also been seen in operating room incidents as well.

This class of accidents happens when something occurs that by itself appears to be minor, but through that miscommunication and mis-assessment the joint system or team of the human and automation responds in a way that creates disaster. The authors note that historically in aviation this is called “controlled flight into terrain” but they suggest that it might be better to describe them as “managed flight into terrain” because the automated systems are controlling the aircraft and the crew is managing the automation.

Based on the research and observations the authors noticed that this going sour pattern seems to be a side effect of complexity. The research and incident data is indicating that when new technology and systems are built that are technology centered or driven as opposed to human centered that operational complexity increases which increases the chance for these going sour incidents.

Like most incidents with hindsight it can seem easy to find places where the chain of events could’ve been broken. It turns out that this can actually be used as a sort of litmus test on whether or not a given incident is a going sour scenario. If, given the advantage of hindsight, a reviewer might say something like “all of the necessary data was available, why was no one able to put it all together to see what it meant?” You can be pretty sure that you’re looking at a going sour incident.

Experience of operators

One thing I like about this research is that it includes the experience and perspectives of the operators, in this case flight crew. It’s really interesting to take a look at some of the things they said about automation:

“I know there’s some way to get it to do what I want.” “Stop interrupting me while I’m busy.” “Unless you stare at it, changes can creep in.”

I think most of us have had similar experiences when working with automation that is not a “team player.” I know I have have heard it from users of systems I’ve worked on as well.

How to treat

Fortunately these going sour type accidents are pretty rare even in very complex systems. This is usually due to two factors:

  • Operators are using their expertise to avoid problems or at least stop the incident from getting worse
  • The problem is only disastrous when a number of other circumstances come together

That first point, that operators are using their expertise to make up for deficiencies or to compensate for features that are presently automation that would normally contribute to breakdown in coordination was a very common feature of the research, and not just from operators themselves. They also heard from training departments, who are also creating their own workarounds to get the job done. This ranged anywhere from notes like be careful around a particularly tricky part of the automation to strategies for teamwork to learning ways to restrict or reduce parts of the automation especially during difficult situations.

“Overall, operational people and organizations tailor the behavior to manage the technology as a resource to get the job done, but there are limits on their ability to do this”

I think this is something that is likely also very familiar to most of us, whether creating, using, or training. Since training and experience is one of the main ways that these crews are developing their expertise and strategies at managing this automation a practice ground for this is all the more important. That’s something that I’ve seen in our field as well, that we don’t always have places to practice or ways to practice. I often recommend to the teams I work with that they set Up tabletop scenarios or game days were similar situations to allow practice. This is another area where aviation seems to be fairly far ahead, in that high fidelity simulators are an expected part of training.

There is a problem though when the economics and competition create pressure to reduce how much is invested in training. Further when there are improvements to the training, those benefits are taking in the form of productivity or efficiency instead of quality. This is where training programs are doing the same training in less time instead of preserving that time window and doing better training in that time.

“the goal of enhanced safety requires that we expand, not shrink, our investment in human expertise.”

What about complexity?

As we touched on earlier, going sour accidents occur in circumstances where many factors come together, the authors give an example of:

  • Human performance is eroded due to local factors (fatigue) or systemic factors (training and practice investments)
  • crew coordination is weak
  • the flight circumstances are unusual and not well matched with training experiences
  • transfer of control between crew and automation is late or bumpy
  • small, seemingly recoverable erroneous actions occur, interact and add up.

Because these issues stem from system complexity, most local optimizations will not help much here, each piece needs to be better coordinated.

Designer responses

The authors provide some examples of the sorts of things that they’ve heard from system designers when those designers face evidence of coordination breakdown between humans and automation:

  • The hardware/software system “performed as design” (crashes of “trouble-free” aircraft).
  • “erratic” human behavior (variations on this theme are “diabolic” human behavior; “brain burps,” that is, some quasi-random degradations in otherwise skillful human performance; irrational human behavior).
  • The hardware/software system is “effective in general and logical to us, and some other people just don’t understand it” (e.g. those who are too old, to computer phobic, or to set in their old ways).
  • those people or organizations or countries “have trouble with modern technology
  • other parts of the industry “haven’t kept up” with advanced capabilities of our systems

Escaping this view

You can see from these comments from developers and designers that this mindset of human error being the cause or blaming over automation are still firmly in effect. Neither of these views considers how to create successful coordination.

“The bottom line of recent research is that the technology cannot be considered in isolation from the people who use and adapt it”

So how do we escape this view? By improving the coordination between human and automation, as in human centered design. (See issue 39 for more about various design approaches and acronyms)

There are some accident analyses that give statistics that say a breakdown in human performance contributes to as much as 75% of mishaps in aviation. Some seem to view this as an indicator that human error is a larger problem when in fact it really should just be more reason to pay attention to human factors at work.

So what is human centered design? HCD is:

  1. Problem driven
    • This means taking the time to understand what the challenges our in a given field and how people cope with it.
  2. Activity centered
    • This means focusing on the activity that the joint system of the human automation are trying to achieve instead of treating them as two separate systems.
  3. Context bound
    • This means keeping in mind that human performance and collaboration depend on the context where it is occurring.

Making Progress

The authors provide a general, noninclusive list of some strategies that an organization can implement to help improve safety:

  • increase the systems tolerance to errors
  • avoid excess operational complexity
  • evaluate changes in technology and training in terms of their potential to create specific kinds of human error
  • increase skill at error detection by improving the observability of state, activities, and intentions
  • investing human expertise

Obviously these are all very high level strategies, but I think that last point, is one that I’ve seen fading in our industry over time.

Let’s briefly take a look at a few of these.

Avoid Excess Operational Complexity

The authors are quick to note that this is not easy, primarily because there is no single point or person or department that decides to make a system complex. It becomes complex when trying to add features or increase performance or provide other options. Ultimately though, the cost of this complexity ends up being paid by the operator who then has to manage more of these features and modes and when they are unable to do so that becomes classified as “human error.” But you can’t replace the person and save it those problems were risks have been mitigated, this tells us that the solution is to fix the system as a whole, increasing coordination across an organization or even industry.

I think it’s important to be said here that this is excess operational complexity. Obviously there are points where our systems and products are able to be successful, because they are complex. They are solving complex problems. That doesn’t mean that there isn’t necessarily simplification that can or perhaps needs to be done.

The authors focus on the examples of different modes and how many are taught in training versus how many are available. They also acknowledge that deciding which of those modes are excessive it is difficult, especially when multiple modes can be used to achieve the same or similar results.

They also mention observability, something that also gets talked a lot about in software. Though I think in software it has sort of drifted it to mean different things and often many things, some of which our potentially conflicting. Here they define observability as:

“observability is the technical term that refers to the cognitive work needed to extract meaning from available data.”

I really like this definition, and I think it’s something that I’ve been trying to increasingly keep in mind when I’m developing systems and processes for others.

Better Feedback

Where we do have to have complexity in our systems and automation, we can offset it and balance it with improved feedback. Especially feedback about what the system is doing and what it will do in the future given the state of the environment it’s operating. This is especially relevant to us in software. Additionally, the research showed that we also need better alerting. Alerting that just ask pilots to look at something more closely or procedures that force them to read a data point loud are not very effective in redirecting their attention in a changing environment.

The authors specifically call out that we can’t solve this problem by adding a new alert each time something comes up. I thought this was especially interesting, since I often see that approach used in creating dashboards and alerting.

“one cannot improve feedback or increase observability by adding a new indicator or alarm to address each case one at a time as they arise.”

Enhancing Human Expertise

Increasing automation often times it is cited as a reason to worry less or invest less in human performance, but the opposite is needed. The more automation that is added, special different states and modes to understand, the more knowledge and skills are needed to handle the new situations. One way to do this is to increase opportunity to practice.

In many pilot training centers the authors discovered that it was pilots themselves that created the guides to the automation. This shows clearly that they, like most professionals, want to improve their skills and knowledge. The shrinking of training time an opportunity it is obviously not the result of it being unwanted by practitioners.

Takeaways

  • A going sour incident is a rare event that starts from a coordination breakdown between humans and automation
  • Human centered design can help solve some of the problems associated with developing systems from a technological perspective or one in which the human and machine are seen separately.
  • When training programs improve, the gains are often taken as productivity and efficiency as opposed to increased quality or thoroughness.
  • Avoiding excess operational complexity in systems can help reduce the likelihood of a going sour incident, but requires systemic fixes, Not just local optimizations.
  • Better feedback can help balance complexity, specifically feedback about what the automation will do.
  • More expertise is needed as automation is added, not less
  • Adding alerts one at a time as they arise isn’t likely to help observability and may even harm it.

Don't miss out on the next issue!