Resilience Weekly - Cognitive demands and activities in dynamic fault management - Issue #14

Welcome back!

This week we’re talking about a chapter by David Woods in Human Factors in Alarm Design, about how to deal with “ dynamic faults”.

Dynamic faults are failures that occur in complex situations, they are different than simply just a broken device on a work bench because their state is changing over time. This is a pretty apt description of many many tech incidents.

P.S. I’ll be going on holiday break for a couple of weeks, so this will be the last issue of 2018. See you folks again in 2019!


Cognitive demands and activities in dynamic fault management: abduction and disturbance management

Quotes:

  • “Revision of hypotheses is a critical component of abductive inference and dynamic situations”
  • “The critical difference between a major challenge and a minor disruption is not the symptoms by themselves, but rather the force with which they must be resisted.”

This chapter focuses on how people deal with these sorts of incidents, specifically the “cognitive demands”.Woods describes the “cascade of disturbances”, the idea that a fault in a system produces a number of effects throughout the system that causes it to further change its state.This can be difficult to track and respond to correctly.

These cascades are part of what can make it difficult to create a good alarm. For example if the cascade is very fast you may just have a huge list of alarms where it’s not clear what’s really happening. You simply see a bunch of alarms going off.

Since diagnosing these usually means understanding what the order of events was over time so that you can determine how it is that this fault propagated through the system, having a dashboard that is just essentially flashing red with a long list of simultaneous alerts doesn’t really help here.

People responding to the incident are going to be acting in response to the disturbances that propagate through the system. Typically they’re going to be working to keep whatever the important goals are of the system continuing. That might be system safety or availability or data integrity.

This occurs simultaneously as they work to diagnose and fix the problem itself, though it is a different activity.

Because they are working to keep the system up or available it’s also possible to mask what the fault actually is. Some of the symptoms of that fault can be lost. This is something good to keep in mind as we design our dashboards and alerting systems. For example, if a fault is typically detectable by a slow network response but someone has enabled some sort of cache to help with load, then we may not notice this symptom that could have helped us diagnose the problem.

Automatic systems can contribute to this difficulty to diagnose as well. You can end up in a situation where everything looks okay until all the sudden it’s not. Just like in people in shock, systems can also decompensate suddenly. At first it can look like the symptoms are small or nonexistent as the automated systems are doing their job at compensating and then suddenly they can run out of resources.This makes the issue go from appearing small to becoming very large, very quickly.

This sort of opaqueness can occur when automatic systems don’t provide enough feedback.See the previous issue on Don Norma’s work for more about that.

We need to be careful that we don’t set up systems that lack feedback as once they enter a decompensated mode it can be difficult for responders to understand what had been happening before. What was going on to compensate before and how did it fail.

As incidents evolve, responders will devote most of their cognitive processing to tracking and understanding how the situation is evolving over time by examining the visible influences.

It can be difficult in these situations to track, especially with multiple faults that all disturb the system in their own way. This can make it hard to sort out what break or failure is responsible for what, or even what happened when.

If the failure affects something that’s highly connected, diagnosis becomes further obscured. These are all situations where we want to make sure that our monitoring and alarms work with us and support us in diagnosing as opposed to just lighting up and continue alarming without context.

This includes things like that ability to see these changes and influences grow or shrink throughout the system. Sometimes this can be as simple as a line graph in a dashboard; that might be enough to say that something was problematic and has since responded to some sort of intervention.

These cues are critical, being able to create new theories about what it is that’s going on in the system or notice that our initial theory was wrong and adjust.

Most often diagnosis is thought as just connecting faults and symptoms. But in dynamic situations, they may not be directly connected. This might sound strange at first, but you’ve probably experienced this at least once during a complex incident, where one thing affects another, which affects another.

For example, where a service may become slow and then unavailable.First you might notice a saturated network link.After looking further you see a database was offline, causing requests to retry.Then you may look further into the database.

How our systems are set up to sense these sort of things can affect how we understand them. They help us interpret the data we see. For example is that saturated network link simply a symptom of a different fault?

This relationship between the symptoms and disturbances all depend on how the fault is propagating through the system, what it disturbs, and how we have our monitoring set up. It is of course possible for an individual to try to understand the data just by looking at at symptoms, but it can be hard to get the best picture with this strategy.

In most cases, we can’t just take the system or the piece of the system off-line. This is what contributes to the situation being dynamic, to it continuing to evolve. If we can take the system off-line and troubleshoot in isolation, it wouldn’t be so complex. But the reality of our situations tend to be that we must keep the system operational, we must keep it serving traffic, allowing our customers to use it.

Keeping the system up and diagnosing happen at the same time

As a result of this, incident responders are going to be working to meet some system goals along with fixing efforts. But which goals and to what degree they intend to meet it, is likely to change as the incident changes. Some goals may be abandoned entirely.As situations get worse, one might say customers being able to interact with the system is no longer of importance if it means data integrity is going to suffer.

Different response strategies

Responders don’t need to know exactly what the potential causes of the problem if they are able to successfully mitigate them. It can sometimes be difficult though when facing a time crunch to not become fixated on searching for it though, even if it won’t help treatment.

Woods cites a study where 1/3 of anesthesiologists treating a patient, didn’t adequately manage the patient’s hypotension.They were overly focused on searching for the source of the problem instead of treating. This reminds me of a saying that we learned in emergency medicine “fix you find” for rapid assessments.

Research also indicates that action tasks are likely to start occurring first, prior diagnosis then go on to continue in parallel with searching for diagnostic information. The interventions that are made are typically going to be to help the system goals mentioned earlier but also to reveal more information. To see if the system behaved as expected as a result of the change or if a new theory is needed.

Another strategy is to look at “control influences” that are part of the system. This looks like responders asking themselves questions like “did I change something recently that would affect this?” “Did I do what I thought I did?”.This occurs across multiple domains, for example anesthesiologists will double check that they gave the right dose or drug.

Modes of response

There are four basic modes of response that responders can utilize to help combat incidents these are:

  • Mitigate consequences
  • Break propagation paths
  • Terminate the source
  • Clean up after-effects

Mitigate consequences

This is really just coping with the situation as a whole. Trying to treat any threat to the integrity of the system. This mode of response tends to occur where the system is deteriorating quickly and yet there is little information about why this is. In this mode the responder doesn’t have to know anything about the fall or failure itself, they can simply focus on the threats to the system.

Break propagation paths

In this mode responders are also directing the action at the symptoms. They still don’t have to know anything about the fault itself in order to respond in this mode. They simply work to keep the faults from propagating through the system by setting up some sort of block.

Terminate the source

This mode is to stop the failure itself from influencing the system. In this mode, you must know at least something about the source of the underlying problem. For example: a pipe break, if you’re able to understand that that is what is causing a disturbance in the system you’re able to patch it so there isn’t any more loss of contents.

Clean up after effects

Once our systems have been disturbed, there are going to be effects that are to persist even after we stop the problem. In this mode, we’re taking action directed at resolving these. How big of a job this is is determined by how skillfully we were able to respond during the incident and how big the incident itself was.

On average, these modes will happen in this order, but not always. In some cases we may jump between modes as we learn more information or disprove our theories

The same data can mean different things

Depending on the mental model the responder has in their head, the same data might mean drastically different things.

Take for example, a disk space monitor. If the person looking at this is thinking that the system is in a normal state, seeing a disk space alarm is going to signal to them that maybe there are evolving conditions that potentially can create an incident.

If they’re in a mindset of diagnosing some unexpected finding, then the alarm may help them confirm or deny the theories that are working on. Or if they don’t have a theory yet, the alarm might help them generate other ideas.

When to take action

When there are severe consequences to making mistakes in incident response there is the question of when an intervention should be performed.Should you act on your best theory or should you wait for more data? Should you go look for more theories? There’s a point where we often must act even though we only have a partial understanding of the situation.

Cognitively, this question is a large burden that often isn’t considered. When incident responders are responsible for actions and the outcome is uncertain and the potential consequences for bad outcomes are very negative, the cognitive burden is increased.This makes committing to a course of action moredifficult.

We can keep this in mind when designing systems around monitoring and observability.Creating systems that support the responders in making decisions about when to act.This could help prevent “cognitive lockup” where responders are unable to change their initial hypothesis even though new evidence is available.

Because events unfold over time is really important diagnostically to develop representations of the system that are able to show how it’s changed throughout time.

Systems to help responders should track what the resulting impact is of the different factors at work, including automation and previous manual intervention is along with the faults themselves.

Knowing is not the same as recalling in the moment

Knowing about the possibility or failure mode of the system is not the same as being able to recall it in the moment. It’s possible for a responder to know of hypotheses for various system states in principle, but be unable to actually call it to mind. Research shows us that this is context cued. So we can potentially develop systems that help cue these hypotheses.

Takeaways:

  • Responders are taking action to gather more information, not just change system states
  • There are different modes of response people can operate in
  • Knowing something isn’t the same as remembering it in the moment, though cues can help
  • Seeing a timeline of events can help diagnosis
  • Data can mean different things depending on what mode we’re in
  • Knowing how people are actually working and thinking can help us design our systems to support this