Resilience Roundup - Managing the Hidden Costs of Coordination - Issue #67

If you missed the discussion last week, it’s not too late to sign up, Hope to see you there Friday, February 14th at 0900 PST (1700 UTC)


Managing the Hidden Costs of Coordination

This is a paper by Laura McGuire it comes from an interesting issue of ACM Queue in which all of the articles are related to resilience engineering and from authors that you may recognize. Also, unlike a lot of the material I analyze here, the author is specifically looking at what we do in software. In this case lowering or managing the costs of coordination, we’ve talked a bit about this previously but Maguire goes into it a bit more in depth, specifically addressing some problems that we as responders and maintainers of software systems run into.

Maguire talks about the hidden costs and that’s because the cost of coordination in periods where things are slower can seem invisible. They’re easily accounted for and as the pace of operations ramps up, like during incident, there’s likely a period where experts (such as you, dear reader) are managing well. So it can seem as if the costs aren’t there. Of course, a point is reached where we can no longer pay those costs and coordination begins to break down. When it starts to break down, these costs become more visible.

Some signs that coordination may be breaking down include:

  • “Difficulties in synchronizing activities
  • disruptions to the smooth flow of task sequences
  • conversation explicitly aimed at trying to organize multiple parties”

To start with Maguire points out that this is something that can be actively designed for or not, that these high costs can often result from using tooling where it wasn’t designed to account for joint activity. I think this perspective, that we can design systems to help coordination and to help control costs is important. She mentions chatops in particular, it’s used often and can be helpful in a lot of ways, but as she puts it:

“Poor design renders ChatOps nearly useless as a tool for sensemaking as people come into an evolving and increasingly pressured situation”

This is because they end up faced with a wall of text.

Maguire critiques other ways in which these high costs are usually addressed in software teams, including incident response frameworks such as the Incident Command System (ICS). I believe she has some valid concerns about it. I also think that those concerns can often stem from misuse of the framework. As she points out, one of the ways of handling coordination is to have an incident commander. The downside to this is that the incident commander can sort of seem to have two jobs. Maguire calls this working both in and on the incident, where the incident commander is keeping track of details of the incident so that they can anticipate needed action in the future, but if they spend too much time or effort trying to get a good assessment of what the situation is that can pull them away from facilitating the coordination between the other roles. In general, she points out that centrally managing who does what can typically lead to a slower response than the pace of the incident so that the responders then fall behind. This makes it harder to catch up with what’s really going on and harder for everyone to get on the same page.

Another approach she addresses is what she calls “enforcing operational discipline” to follow the ICS. If you’ll be at SREcon America West, you’ll hear me talk a bit about this. I do not believe that ICS should be used to enforce operational discipline in the sense that Maguire seems to be addressing. I certainly don’t contest that it can be used in that way. For example, she points out that one adaptation as the costs of coordination get high is that groups will split off and form their own Slack channels, sometimes away from the main incident channel as a way of lowering the costs and synchronizing their work with the relevant people. Some people will tell you that this does not follow ICS, but I strongly disagree with this notion. In fact, most communication systems that are used by agencies that use ICS at various scales, fire departments, EMS, etc… typically have their own TAC channels. Sure, there is one main channel. We might think of that as out incident channel where most or all responders are tuned in. They typically talk to dispatch and dispatch talks to them, but very, very rarely will you see a setup like this where there aren’t also channels to talk to other responders in the area. As Maguire puts it:

rather than forcing responders to bear significant attentional and workload costs, it is advisable to facilitate shifting various lines of work to subgroups while supporting connecting the progress or difficulties into the larger flow of the response.”

That’s not to say that ICS is the best or only choice. Unfortunately it became the mandated choice after September 11th for many agencies. Because it has so much training material behind it and is in use by agencies that deal with high consequence domains, it has been somewhat adopted in software as well. This is all fine and good, but the issue is that development sort of stopped at this point in many places. And I think that’s a bit of what Maguire might be seeing or getting at. There are many environments where, just like with other policies, the gap between work as imagined and work as done is ignored and responders are encouraged to just follow the policy better or learn the procedure better. We of course know that that doesn’t work. She also points out that even emergency services are realizing that there are a lot of limitations of ICS. Again, because it’s sort of got frozen in time and as a result there are attempts at something more flexible.

So what to do instead? Well, to start understand that people will make adaptations to manage the costs of coordination. Instead of trying to beat people about the head with policy around that, look for ways to support it. Perhaps also investigate why it might be an issue. An objection I hear sometimes is that it can make things difficult to investigate in some sort of post incident review. Sure I can see that that having some extra channels would do that but with most chat software, you can export various channels and they are timestamped and fortunately as people who build software systems, it’s quite likely that if we need to do something with the data we have the resources available to us.

I find it a bit ironic that there is sort of push back on this for the sake of post-incident review. It somewhat mirrors the idea again of humans adapting to the larger system as opposed to it adapting to them. In this case the system being policy as opposed to a machine, but instead of using tooling or something else when things are at lower speed after the incident is over to correlate data, I’ve often seen this pressure to change the way of working.

With this in mind, we can also begin to look at incident response plans and practices from this view. We can ask of it, would this or does it, hold up when things are hectic and fast paced? If not, then something else will likely need to be found and chances are something else is already being found by individual responders or by groups as they adapt out of necessity. I can also pretty much guarantee that if you review your incident response plan and it is very specific and very detailed to a very deep level about what responders should and shouldn’t do, it does not hold up during fast paced incidents.

I’ve talked a bit about different views on how teams reorganize as time pressure increases, and that they typically don’t follow a rigid command control structure. Also, the role of someone who cares about and wants to improve the safety of the system, they might influence adaptations. Maguire points something interesting out, that is a little bit different from what I think some of the common narrative is around incident response. Which is that it is helpful to learn from physical world emergency services. I of course am quite biased to that view, but also we’re reaching a point where our critical digital infrastructure or CDI as she calls it is moving faster and can have large scale incidents with higher costs of coordination than the kind that we typically would see in various emergency services.

I think some of the traditional narrative comes from a place of sort of devaluing or downplaying the role of responders in software systems. There is a trend in software to say, well no one’s going to die, which for one we don’t actually know, but also it doesn’t address, the magnitude, the number of people, the number of systems, the number of countries, communities, and so forth that are influenced.

I get it can feel very different to see a very large fire on the news and it can be hard to hold up sitting in an office managing an incident and see a lot of relation, but in software our incidents are going to move typically much faster and a lot of the cognitive work is going to be the same.

She doesn’t have specific advice on exactly what to change about your tooling, which I think is good since teams are using different ones. But I think it’s important to consider whether or not the tooling you’re choosing or designing keeps us in mind at all. She suggests a few places to start:

  1. assessing coordination strategies relative to the cognitive demands of the incident;
  2. recognizing when adaptations represent a tension between multiple competing demands (coordination and cognitive work) and seeking to understand them better rather than unilaterally eliminating them
  3. widening the lens to study the joint cognition system (integration of human-machine capabilities) as the unit of analysis; and
  4. viewing joint activity as an opportunity for enabling reciprocity across inter- and intra-organizational boundaries.

Takeaways

  • Coordination doesn’t come for free
  • Costs of coordination are not always obvious because experts manage to fill the gaps
  • It’s typically not until coordination breaks down that costs are readily apparent
  • Some signs that costs are rising or not being managed include:
    • “Difficulties in synchronizing activities
    • disruptions to the smooth flow of task sequences
    • conversation explicitly aimed at trying to organize multiple parties”
  • Incident response framework need to be flexible to allow adaptation
  • Response plans and frameworks that are inflexible can cause responders to fall behind the pace of the incident.
  • Costs of coordination can be thought of while designing tools, so that during fast paced periods they don’t hurt rather than help.

Discussion

Last week we had some great discussion around somethings like:

  • What should we do when research suggests futures or actions that seem unrealistic?
  • How can we take action today on improving our automation?
    • What are some tools that are especially good or bad at directing attention?
  • What are some other ways of directing attention available to automation?

If you want to be part of the discussion for this paper, make sure you’re signed up, I’ll send you a meeting invite for this Friday, February 14th at 0900 PST (1700 UTC).


Don't miss out on the next issue!