No discussion group this week, it’ll be on pause for a bit as I’ll be traveling for some conferences. Don’t worry, I’ll make sure to send an email before the next session. Thanks to Tanner Lund for bringing this paper to my attention.
This is a study done by Makoto Takahashi, Daisuke Karikawa, Genta Sawasato and Yoshitaka Hoshii and presented at the REA symposium last year.
On the surface, this may not seem like it relates to us in software, but I think this experiment and it’s findings are very relevant and can help us revisit how we write documentation and things like run books.
They created a complex environment, specifically a simulated power grid. Operators have to keep the grid at an ideal voltage, balancing 4 different power inputs (batteries, solar, wind, hydrogen). The primary goal is to avoid a blackout, the secondary goal would be to keep the grid stable, as close to the ideal voltage as possible. Additionally, operators would have to perform an inspection of the power sources (by clicking a button). This inspection would shutdown the given source, but if a source wasn’t inspected for a period, it would fail.
40 people who had been screened to try and make sure they had approximately the same level of ability with the interface and procedures were randomly divided into two groups of 20 each, A and B.
Both groups were given a basic manual that gave an overview of the “minimal necessary information” for completing a task. This included things like what each source was rated for and under what conditions an emergency stop should be performed. Both groups received brief training on performing and operation.
Group B was given another manual that essentially consisted of flow charts. This is the “procedure manual” that tells exactly what to do in many situations that were abnormal, but were expected. This included things like one source not being available (i.e. heavy rain so no solar) or a battery failure. Group B had a training session where the importance of following the procedure was emphasized (sound familiar?).
Each group then went through some practice sessions for normal events, then the abnormal but expected events. With group B though, they had someone watching and if the procedure wasn’t followed, they were warned to do so. No group was told about some other events that would occur during their evaluation, the unexpected, abnormal events.
Here’s where it gets interesting.
Group A had more blackouts in the normal and expected abnormal situations than Group B, but had fewer in the unexpected abnormal situations. It seems that the group with the procedures, Group A, wasted time and experienced more frustration in unexpected situations by looking for a matching procedure first. When one didn’t cover it, the they then had to use their own judgement, but it may have been too late to effectively intervene.
Group A, who got to use their own judgement also kept the grid more stable than Group B. Group B had a procedure that said if the grid voltage was above a certain amount or below a certain amount, they would intervene in a particular way. By being able to exercise judgement, Group A was able to make interventions that reduce the variance of the grid power over time.
They also some things like frustration and time pressure in the participants. It wasn’t markedly different in each group, but Group B did experience increased frustration during the normal and expected operations.
What we do
I think this study and this line of thinking is a great way for us to approach how we write procedures ourselves. In ops and SRE teams, this often looks like run books or similar documentation. I don’t believe that the two groups in the study represent all the options. We can continue to document how to do some small things while also encouraging that judgement is exercised.
Also, the documentation or training material we write can address more of the overall picture and the background, similar to what Group A experienced in the study, a broad overview of what interventions were available to them.
- This study can help us inform how or perhaps even if we write runbooks or similar procedures.
- Being encouraged to follow procedures instead of use judgement for troubleshooting can make unexpected situations that the procedure doesn’t account for more difficult.
- Procedures can reduce individual variation in performance.
- Being allowed to use one’s own judgement can increase performance in expected situations, but can also increase frustration in some situations.
Subscribe to Resilience Roundup
Subscribe to the newsletter.