Resilience Roundup - How Adaptive Systems Fail - Issue #34

Basic Patterns in How Adaptive Systems Fail

In this chapter David Woods talks about the 3 ways that adaptive systems fail.

Knowing these ways can help us recognize them in our own systems and organizations.


To help understand the failure modes, Woods gives examples from urban firefighting.

I really enjoyed this chapter, in part because it uses examples I can relate to, Incident Command in emergency services.

We’ve discussed these failure modes a bit before, but its worth digging into and really understanding them.

The 3 modes are:

  • Decompensation
  • Working at cross purposes
  • Getting stuck in outdated behaviors

Each of these has some variants as well.

Decompensation

This failure occurs in two parts. First, something, a person or automation for example, is compensating for some disturbance. Ultimately though, as the disturbance grows, they can no longer fight it, exhausting their ability to continue to adapt, causing the system or at least that portion/metric to collapse.

This failure is especially troublesome with automation that doesn’t communicate that it’s spending increasing resources (effort essentially) in compensating, so when it is exhausted the decompensation becomes a surprise.

A very recognizable form of this pattern is “falling behind the tempo of operations,”. For example, what we talked about last week in Being Bumpable where there was a sudden crunch period for ICU beds.

Working at cross purposes

This mode is where adaptations and optimizations in one area end up hindering another. This could be different teams or groups or even levels of an organization.

Often when we talk about it as levels in an organization we tend to think of that as being top down. That is, something at the blunt end making work difficult at the sharp end, but this can take place in any direction.

An example of this happening in a bottom up way could be a workaround at the sharp end that doesn’t take into account broader goals or constraints.

Woods uses an example from firefighting, where one team opened a window for ventilation near a fire escape, which ended up blocking that egress path for a crew on the floor above.

In firefighting and in any incident management framework, including software, that follows the FEMA ICS model (e.g. where Pagerduty got most of its guidance), preventing and resolving this failure mode is the responsibility of the Incident Commander role.

It is critical that the Incident Commander be able to assess the situation continually throughout the response.

Getting stuck in outdated behaviors

This one is pretty much what it sounds like. This mode occurs when we are no longer updating our approach as the challenges change.

This can happen when strategies are developed that were successful in the past, but something about the environment has changed. Instead of changing behavior to match, organizations and teams can fall into the trap of continuing to try that which worked before.

This can sometimes be complicated by uncertainty in the operating environment or because of a high cost or difficulty in re-planning.

Some variations or “sub patterns” here are:

  • Oversimplifications
  • Failing to revise current assessment as new evidence comes in
  • Failing to revise plan in progress when disruptions/opportunities arise
  • Discount discrepant evidence
  • Literal mindedness
  • Distancing through differencing
  • Cook’s cycle of error

What to do with this information

Once we know and understand the ways that adaptive systems fail, we can then look at our own systems and try and see where they might be vulnerable to these failures. Also, knowing these patterns can help us understand other system failures that we might learn from like accident reports or case studies.

It’s important to note that whether or not something is well adapted or under adapted or even maladaptive is dependent upon your perspective. As we touched on before, it could be something that seems maladaptive to you is an adaptation that is effective elsewhere.

Because these labels, well adapted or under adapted, depend on your perspective in the system and because all systems face trade-offs, we can then conclude are:

  • Well adapted in some areas
  • Under adapted with room to improve in other areas
  • Maladapated to new challenges to the normal function

all at once.

Woods calls this a basic property of adaptive systems and tells us that because of it any form of linear causal analysis is in adequate for modeling predicting the behavior of those systems.

Because of the aforementioned role of the Incident Commander and their assessment of the situation, resolving or preventing this can also be helped by that role.

Takeaways

  • There are three ways that adaptive systems fail, with many sub patterns for each one.
    • Decompensation
    • Working at cross purposes
    • Getting stuck in outdated behaviors
  • Understanding these patterns let us examine our own systems and try and recognize them before failure occurs
  • Being able to recognize these patterns also helps to evaluate other incidents or incident reports.
  • What is well adapted or under adapted will vary depending on your perspective.
    • This means that a system can be said to be well adapted, under adapted, and maladapted at the same time.

Don't miss out on the next issue!