The title is a bit of a mouthful which can disguise that this article by David Woods and Lawrence Shattuck is a great reminder of some things to consider when setting policy for SRE or ops teams. They give the examples of distributed supervisory control being air traffic or military command and control, but the lessons could as easily have been written for Directors or Engineering Managers trying to setup SRE or guide ops teams.
“Distributed supervisory control systems are hierarchical and cooperative. They include remote supervisors who work through intelligent local actors to control some process.”
This means that if you are someone who provides supervision, or you design larger systems, or even if write procedures, then you’re someone who can be thought of as a “remote supervisor”. That person, likely has a different viewpoint than those on the front lines. Likely they have a broader set of concerns or at least can see more of the organization’s concerns.
At the same time, those on the ground have more access to whatever the “monitored process” is, whether that’s keeping a website up, responding to an incident, etc… While they have increased access, they likely have a narrower view or at least narrower concerns.
Though those writing the plans and procedures may have a larger organizational view, the plans are still not enough (on their own at least) to deal with the potential for surprise in all situations that could arise. As a result, those on the front line, the local actors, must adapt those plans to the situation they’re facing, based on their understanding of the intent of the plan or procedure.
Previous studies of nuclear power plant operations by Woods found that:
“good operations require more than rote rule following”
That holds true for trying to create or lead SRE teams as well. There are two different types of failures that can arise when trying to follow a procured that has a lot of moving parts and depends a lot on what is actually happening at the time.
- “Type A” problems, where rules are followed by rote, even though the circumstances have changed and require adaptation.
- “Type B” problems, where adaptation is tried, but the knowledge or broader view needed to meet recovery goals isn’t available. This class of problem often includes being unaware of side effects of changes.
Though not easy, there needs to be a “cooperative balance” between the remote supervisors and the local actors. Those on the front line need to have knowledge and authority to respond to surprises in ways that help achieve higher level goals.
This was seen as well in space shuttle mission control, where there is a tension between contingency planning and the unique situations that rise in anomalies that couldn’t have been specifically planed for. When an anomaly or incident occurs, whether in space operations or in websites, the disruption tends to leave those on the front line with similar questions when examining procedures.
They ask things like:
- Is the given plan relevant to the situation they now face?
- What was the intent behind a rule or policy?
- What assumptions were made when it was set?
Looking at where mission control, showed that the teams considered implications of the anomaly for future missions and cooperated across teams. They would revise plans and develop new contingencies, where the original plans and procedures were used as resource.
Ultimately there is a tradeoff that must be made between remote supervision and local action. The authors call skill at making this tradeoff the “resilience function” of a distributed system.
They conclude that:
“Supervisors and the larger organizational context must determine the latitude or flexibility they will give actors to adapt plans and procedures to local situations given the potential for surprise in that field of activity”
If centralized control is chosen that limits the ability of those on the front line to adapt, then Type A problems are more likely to occur. On the other hand, if the frontline has no guidance at all and complete autonomy, then other goals and constraints will be invisible and coordination make break down across teams and Type B problems will be more likely.
“The standard reaction to accidents like this is for stakeholders biased by hindsight to miss completely the underlying tradeoff and resilience function.”
I’m sure we’ve all seen this. As a leader it can be tempting to follow organizational pressure and resort to policies and postures where more rule following is seen as the answer. But what really happens is that the trade off is simply weighted away from adaptation towards rule following, increasing risk of Type A instead of B failures.
“It is critical to see that this organizational response does not enhance skill at handling the inherent tradeoff—resilience”
Not only does it not enhance skill, but it can create a double bind for those responder on the front line. If they aren’t able to adapt to handle surprise they get blamed, but if they do try to adapt and it doesn’t work, they get blamed.
Instead, organizations can seek to understand and explore that trade off. They can find mechanisms to reveal larger goals and constraints so that they can be considered when adapting.
High reliability organizational research tells use that organizations can also anticipate and plan for possible failure holding “the continuing expectation of future surprise” instead of taking past success as an indicator of future success.
- The amount of potential surprise in a given type of work cause the plans and procedures for it to be underspecified.
- Organizational reaction to this reality can create issues for those at the frontline.
- Anyone who writes procedures or designs systems for others can be see as a “remote supervisor” in this framework.
- Remote supervisors tend to have a broader view of the organization and it’s goals.
- In contrast, those on the ground have direct access to what is actually happening, but a narrower view.
- Two types of problems can arise with this setup
- Type A problems, where rules are followed where adaptation is needed.
- Type B problems where a adaptations are tried, but knowledge of side effects or other information is unavailable, so larger goals cannot be addressed.
- There is an inherent trade off between remote supervision and local action as it pertains to adaptation. Skill at making that trade off is the “resilience function” of a distributed system.