To Intervene or not to Intervene: The Dilemma of Management by Exception

Here we're looking at an article by Sidney Dekker and David Woods. They looked at what happens when humans move into more a supervisory role for a process, one in which they're primarily intended to handle unexpected cases. As you might expect, things don't exactly go as planned.

Instead of more actively controlling aircraft, controllers would become "traffic managers". This would happen as aircraft would gain more individual autonomy, air traffic control would then step in if there were something that could not be handled normally. While they focused on air traffic controllers and pilots, what they learned has direct applications to us in software systems, especially SRE type teams.

As the title indicates, the process is called management by exception. It's been around as a management approach for a very long time. The core idea being that managers intervene when someone in their reporting structure brings them some sort of case that cannot be handled by existing processes, and would primarily passively supervise the rest of the time. What prompted them to revisit the issue is that it was suggested by a standards body task force (RTCA Task Force 3), that air traffic controllers begin to follow a similar pattern.

The authors use a Peter Whicher quote to summarize the goal for ATC:

"To permit unrestricted ATC growth, we should first determine how to eliminate one-to-one coupling between a proactive sector controller and every aircraft in flight. The basic requirement is to minimise human control involvement in routine events, freeing controllers to concentrate on key areas where human skills have most to offer..."

This idea that we'll automate what machines are good at and leave humans to do what humans are good at is a fairly old one, but still quite popular in many fields, including software. A problem is that it ignores the system of the human and computer working together, treating them as separate disregards the cognitive work of the human.

The authors went through a lot of research in this area doing multiple studies, but this article just focuses on one study and it is a very good study for us to learn from.

If you're still not convinced that this applies to software, I want to leave you with this quote from Earl Wiener, who could so easily have been talking about us:

"Developers and researchers have tried to avoid considering the complications of judging/anticipating exceptions, as defined above, by reducing supervisory interventions to fixed rules about how to respond to specific situations. Pre-defined situations (often via threshold crossings on process parameters that can be measured) become the triggers for supervisory intervention and are compiled in procedures and policies for the supervisory controller"

The simulation

In order to discover the cognitive effects of this shift to management by exception in ATC, they created a simulation environment to put a couple of controllers, a pilot, and a flight dispatcher.

They don't give a ton of detail, but the way they set up this experiment can be a good example to learn from. It shows us that technically and cognitively involved or difficult processes can be practiced and learned about without having to use high levels of technology itself.

I hear all the time, "How do we get started with game days?" or "What do we do in a game day?". Often, teams want to have a very technical solution right off the bat, typically defaulting to the idea that since the problem space is technical that the game day must be equally technical to be useful. This clearly isn't the case as demonstrated here.

They setup a view of what the radar would look like, with indicators for where various aircraft were and also maps of airspace that represented a possible future state. They had participants review the task force suggested rules and process before hand. They also provided faked advisories and handbook pages that matched the world they were simulating (if the suggested procedures were adopted).

The situation they presented was that an aircraft set its transponder to inform others that it had a radio failure and problems with systems that managed altitude and collision avoidance. There were already inbound aircraft in another direction, but to make the situation even more uncertain, the aircraft had just flown to a higher altitude before indicating the issue. Also, the participants were given a flight plan that indicated the aircraft had been headed for its home base.

The flight plan made it unclear if it would continue with that plan or if it would make a landing at the soonest available airport.

Overall, the participants dug through what information they had access to to try an answer determine how the situation would infold. They looked at how others might be responding as well. The authors sum it up as the participants working to answer the question "Is this a situation I need to get involved in?". This was a pattern seen in all the studies as well. They also debated about what strategies they could use keep the situation from getting worse.

The participants debated about what the aircraft would do and where it would go. Interestingly, there are some policies and procedure around what an aircraft may do in this situation, but "there is only loose coupling between procedure and practice". Which sounds a lot like much documentation that I've encountered. Further, the documentation tends to dictate what an aircraft might do, not what it will and should do nor what others should do.

During the experiment no one could really agree on when to intervene or to what degree. Intervening early was problematic because it could have a number of downstream effects on latter traffic. But choosing to wait to intervene would allow time to gather more evidence, but could mean that there were no longer any options to change the situation. This was further complicated by information being fragmented. In a world where other aircraft have more autonomy and are not immediately under "positive control", information needed could be spread out to such a degree that no one person could put it all together.

This experiment and others have shown that when we use a framework like management by exception there are three interrelated judgments that can make it cognitively difficult:

When to intervene
How broadly to intervene (in the experiment this would be the number of aircraft to direct)
How much of the floor should be taken from the agents

Ultimately the participants "finally arrived at an option where they effectively excluded themselves from any control over the situation." They decided that incoming aircraft would have a better view of the situation that they would, thus they would do nothing.

The results

The authors say "turning the human into a higher level supervisor often does not reduce human task demands, but changes them in nature". That certainly was the case here.

One thing that hits very close to home for me is when the authors say:

"the idea that an exception manager would intervene and supervise process only when the report reaching him or her demanded so, has remained in modern treatments as well."

This quote alone seems like a description of every major monitoring or alerting process I've seen.

One issue is that when we talk about a supervisor stepping in that's a very vague notion. Previous work by Thomas Sheridan gives as many as 10 levels of control:

The subordinate:

offers no assistance: human supervisor must do it all;
offers a complete set of action alternatives, and
narrows the selection down to a few, or
suggests one, or
executes that suggestion if the supervisor approves, or
allows the supervisor a restricted time to veto before automatic execution,or
executes automatically, then necessarily informs the supervisor, or
informs him after execution only if he asks, or
informs him after execution if the subordinate decides to
decides everything and acts autonomously, ignoring the supervisor.

So when we talk about a supervisor stepping in, the cognitive work of following along, knowing when to intervene and at what level, and when to switch levels is very often just overlooked. That's because most of the definitions of supervisory work make a number of assumptions:

The derivations being looked for unknown
The evidence of anomalies it is obvious
That level of intervention required is a clear-cut decision.

Of course none of these are very likely to be true for any sort of challenging situation. It also assumes that the supervisor is a sort of sitting around passively waiting information. But from research in the management we know that efficient supervisors actually seek out information. Some these issues stem from the fact that the research on supervisory control have sort of blurted the definition of what an exception is and have merged two things together: some sort of anomaly in the process being monitored itself. Secondly the judgment about the way the situation is being handled well or will be handled well in the future, either by automation or other humans.

These clearly aren't the same, but both get the label "exception". To help clarify the authors define an anomaly as something going on in the monitored process that is not behaving as expected, whereas an exception is a judgment about how well that other human or automation is handling the situation or might in the future.

This leaves several some opens questions, namely: what is enough evidence the exception will occur that could help decide what level of intervention to use?

These are all things the participants struggled with and things I think we as software practitioners struggle with as well. Much of our incident response structures follow similar patterns.

"computerised support that is not cooperative from the human controller's perspective... can make the relatively easy problems in a controller's life go away, but make the hard ones even harder."

Takeaways

Having people be exception managers doesn't reduce cognitive load, it just displaces it or even increases it
- Despite this, the model is common and oft used, especially in software system monitoring
- There are 3 intertwined judgements that need to be made
  - When to intervene
  - How broadly
  - How much authority is taken from the other party (human or otherwise)
There are many different levels at which someone can intervene
Deciding when to intervene, early or late, can both have their own downsides.
Fragmenting information throughout the system made it hard for any single person to put all together.
Simulations need not be very high fidelity in order to be useful

Subscribe to Resilience Roundup

Subscribe to the newsletter.