Building and revising adaptive capacity sharing for technical incident response: A case of resilience engineering
This week we are taking a look at a paper by Richard Cook and Beth Adele Long. In it they study how an on-call team was able to share capacity.
This case study highlights the engineering part of resilience engineering. As the authors about point out, there are some unique aspects of their setup that makes it work. There’s some generalizable information we can take from the case though.
As the authors explain:
“the frequent but irregular drumbeat of operational incidents stresses the organization’s capacity to adapt.”
This meant that many incidents were taken care of by the normal on call teams, but every so often some disturbance would be especially difficult to troubleshoot. For these cases, a new team was formed, made up of eight engineers and managers as a special support group. One member of the team would join to help the response when some sort of threshold was met.
This worked for a while, but there were some issues. Sometimes the initial responders wouldn’t call for help. This occurred for a few reasons, sometimes they were distracted or thought they had it under control. To help with this, a program was written to automatically message the “on-call expert”.
This also had the benefit of not requiring the initial response team to figure out when it was “ok” to ask for help. They didn’t mention it in the paper, but I imagine that this help was welcome and effective is a credit to the teams approach in how they worked together.
This approach also allowed others to be less distracted by incidents because they knew that a member of the group would respond if needed.
If you’ve ever been on-call I’m certain you know what happened next. The team was successful, but this meant more work in addition to their regular workload, as happens with many successful systems. As a result some people left the team.
The company then worked to make joining the team more attractive and more sustainable. They did this by making team membership required for career advancement and providing financial incentive. Eventually rotations were created so that others could join the team and have other members rotate off. This helped ensure that the team continued to have capacity to lend to others. Other measures were taken to help preserve capacity as well, such as lowering the normal work that the responders were responsible for.
The authors point out that there is a strange paradox in being an incident responder in the world of software. Because most disturbances are routine and because they may be handled by frontline personnel who limit the immediate consequences:
This has the paradoxical effect of making impactful events more difficult to diagnose and repair. A result is that the next incident is likely to be easily dealt with but may be challenging or even pose an existential threat.
Also, because incidents differ from each other, the resources marshaled for one, may not fit another:
Being able to modulate the response to match the need is important. ‘Ordinary’ events can be handled by the regular responders but the response to larger events benefits from additional expertise. To respond effectively and efficiently to the variable, unpredictable challenge pattern requires the capacity to adapt.
There are some special things about the environment that this team operated in that helped it be successful:
- Incident rate: There weren’t so many incidents (approximately a dozen a week) that the team was overwhelmed. On the other hand if there were very few incidents, then the sharing of adaptive capacity would happen rarely and may be more cumbersome as getting up to speed and learning the process would be slower.
- Incident length: Here again there is a bit of Goldilocks element, if incidents were really short, there wouldn’t really be a need to have a response from this team. If incidents lasted a very long time, then the team sharing the adaptive capacity might be harmed by it and be hesitant to do so in the future. Incidents in this case tend to last “minutes to hours” with few lasting more than one day.
- Incident severity: As with length if all incidents were small, then there wouldn’t be much need for the team nor the sharing of capacity. If the incidents were always very severe, specialized and dedicated response teams (as opposed to capacity sharing) would likely be formed instead.
Also, the group has some amount of knowledge they have in common, or else they wouldn’t be able to have their own rotation. If they were all very specialized experts or were at very different levels of knowledge or skills or experience, this sharing wouldn’t be as effective. If that were the case, the time it would take to ramp up when an individual responded would likely be more of a hinderance than a help to the receiving team.
While this case doesn’t mean that you have to copy all or really any of their patterns, knowing about what makes it work and what challenges they have can help you find opportunities or avoid pitfalls in your own organization.
- The borrowing and sharing of adaptive capacity is a form of resilience engineering.
- This is consistent with David Woods' theory of graceful extensibility.
- Not all incidents are alike of course, so matching the response to the needs is required. One way of doing this is by having a source of adaptive capacity to borrow from.
- This helps us see the engineering part of resilience engineering, not just the resilience part.
- There are conditions that are fruitful for such engineering to take place, other conditions may help too, but the absence of some may make certain forms of engineering untenable.
- This includes things like rate of incident, incident severity and knowledge amongst responding team members.