Only 2 more regular issues remain before the new Resilience Roundup launches, you can get a discount as an early adopter here.
This a very new paper from Sidney Dekker and Michael Tooma that just came out this year (2021 for those of you reading from the future). It gives an overview of why tracking injury metrics is a flawed approach that doesn’t lead to safer outcomes and can in fact distract attention from things that would help safety arise.
I think for us in software, this advice and the paper as a whole is especially useful if you replace the idea of injury rates with similar incident metrics, like MTTR.
If an organization focuses on metrics like injuries (or MTTR) then they may be ignoring risks that are developing in other areas that are not apparent based on that metric alone.
Building one’s understanding of safety mainly on injuries and incidents is akin to trying to understand how to have a happy, life-long marriage by only studying a few cases of divorce. It misses much of the interesting data.
Instead, the authors suggest a capacity index to capture a range of information all of which contribute to safety. I’ll go over the index and then a bit about the capacities.
A capacity index
The capacity index tracks:
- the building of capacities in people so that things go well even under variable conditions (Know);
- the capacity to anticipate through risk competence and risk appreciation at all levels of the organization (Understand);
- the capacity to make resources available and goal conflicts visible (Resource);
- the capacity to monitor and identify issues through effective communication channels (Monitor);
- the capacity to assure the effectiveness of this monitoring (Comply);
- the capacity to learn from both failure and success (Verify).
Safety in Complex Systems
Safety in complex systems doesn’t arise from control of behavior or control of incident metrics or even standardization. Instead safety comes from variability in responses, in adaptation.
"Guided adaptations to local conditions and challenges is likely to generate greater safety improvements than greater centralized control will"
For more about guided adaptations, see my analysis here.
This relies on an organizations understanding and growing their adaptive capacity so that they may respond to handle (previously) unknown disturbances.
Additionally, while it is impossible to prevent all future failure modes, it is possible to identify at least some system weaknesses. Chaos engineering and designing such experiments can be a path to this.
Capacity for Anticipation
In order for the capacity to anticipate future failures, there must be some form of monitoring of threats.
If we say that anticipation involves pattern recognition, then that requires similarity between past and current (or future) events. To have such similarity that we could make good deductions may be impossible in complex systems.
There is also the risk that anticipating in this style can lead us to fixation errors. (More on fixation errors) Another form of anticipation is building scenarios and simulations, though this can be difficult. There is a danger that scenarios may be too optimistic about what an organization can control or its capabilities.
Goal conflicts and resourcing
Safety is a single goal amongst many in an organization. When there are many goals, there are inevitably goal conflicts. These conflicts work their way down through the entire organization and must ultimately be resolved by those at the sharp end doing the work.
If that sounds familiar, it may be because David Woods and Richard Cook give the same advice in order to move forward from error.
Understanding these conflicts and tradeoffs is crucial if safety is to be given proper resources.
Incident response authority and ability
The ability to respond to incidents is rarely created by centralized response structure. Permission may be given by such a centralized place, but effective response comes from moving authority to those closest to the event, the ones responding to it and interacting with it.
Pre-written protocols cannot capture the range of adaptation that may be required, so its unlikely that narrow permission given in advance would help.
One way to expand adaptive capacity is to increase diversity among those who are making decisions. Another is to give the authority to say "stop" to a process and rewarding use of that authority.
Learning from incidents
The capacity to respond to an incident is also affected by how an organization responds after and incident, especially how it treats those involved.
You can heave learning or you can have sanctions, but not both at once.
The authors recommend a "restorative" process take place where everyone impacted by the incident can together figure out what to do to repair the damage. Restorative justice is about consequences and causes, not about judgement and sanctions.
The authors give a variety of ways that each of these capacities could be measured, like the number of insights from workers over some period of time. They suggest using million hours worked as a time period, but I think in most organizations something quarterly would probably be more familiar.
This can also be used to track number of learning reviews, or the number of those worker ideas implemented.
Ultimately, while this is something still in development, I think it works as a great alternative to the common incident metrics in place at many organizations like MTTR and their ilk. Its certainly an easier argument to make for most folks looking for change than "don’t do incident metrics at all."
- Metrics like injury rate don’t give insight into the potential for future disaster or other incidents.
- Focusing on narrow metrics like injury rates can draw focus away from other risks in an organization.
- Instead, metrics that encompass an organizations capacity to learn from incidents and respond to future disturbances can be more useful.
- The capacity index tires to capture an organizations capacity to build capacity itself, to anticipate risk, to identify goal conflicts, to monitor threats, and to learn from success as well as failure.
- Goal conflicts are inevitable whenever multiple competing goals are in play, which is why its so important for those conflicts to be visible. This can help safety to be given the resources needed.
- In order for the capacity to respond to incidents to be more effective, the people closest to the incident must have the authority to do the work.