Safe operation as a social construct

This is a paper by Gene Rochlin that picks up after the Berkeley High Reliability Organizational research and as he puts it:

The purpose of this paper is to set out the broader definition of safety that emerges from this work

Throughout the paper Rochlin covers several qualities that helped the groups be successful and focuses on those qualities that are socially constructed.

In interviews operators expressed ideas of safety being a positive thing, not a negative thing. That is, safety wasn’t the absence of errors, it wasn’t just the avoiding of mistakes. This was true across a number of demesnes including air traffic control and nuclear power plant operators. In social science they might call this a frame a social construction.

But neither of those ideas from social science fit especially well with what the HRO research was showing. None of them included the “dynamic role of operators”. Further, much previous research, even those that explicitly saying this like “we emphasize agency” quickly fell back to Charles Perrow’s Normal Accident Theory, the idea that some accidents cannot be avoided due to the complexity of the underlying system.

Because of the difference between fields and specialties, defining safety in one industries terms and views didn’t make sense:

For some of the organizations, there were types or classes of error or accident that could not be tolerated at all. Others performed tasks for which the complete absence or suppression of errors and accidents was not possible; they were judged to be safe because the measurable and observable rates of errors or near-misses was so much lower than one would expect by comparing their performance with that typical of more conventional organizations

This means that the safety demands could change, not just industry to industry, but from task to task. This leads to a realization that:

The position of any operational group as an actor is therefore situational and dynamically constructed, and evaluations of safety are not easily connected to ‘objective’ measures of real world performance.

Rochlin suggests that one way to gauge the construction of safety and differentiate it from over confidence (overconfidence and allowing past success to dictate future behavior being a key theme of HRO research) is to look at how error is attributed:

the lack of self-promotion or aggressive self-confidence is in itself a useful gauge, as is the desire to attribute errors or mistakes to the entire organization (including its structure) as well as to human action

Learning

Rochlin also explains that how (or even if) an organization strives to learn is critical to its ability to create safety:

But it can be difficult to asses whether or not eh the organization itself is learning (as opposed to just occasional individuals). Complicating the issue further is that groups will learn different things at different times from the same events depending on what signals the group is attuned to.

Since many of the fields studied were ones in which error could be catastrophic, teams also learned from near misses and even things that might have turned into a near miss.

Duality

The organizations studied by the HRO research each had “the ability simultaneously to maintain multiple representations of state and structure of operations.” These dualities played out in ways such as safety amid risk or independent action in the face of very close regulation. Interestingly, when these dualities were pointed out to the operators during interviews, they explained they saw those frames as “reinforcing rather than standing in opposition”. Without interviewers using specific framing and wording of questions the operators themselves did not even note the tension between the two things, they saw no organizational contradiction between them.

Further the operators also indicate (like commercial pilots had) that safe operation depended on them treating the operational environment as not just inherently risky, but going even so far as to say that it needed to be treated as “actively hostile, one in which error will seek out the complacent.” Here again, the theme of confidence and specifically managing its placement arises.

Confidence in the equipment and the training of the crew does not diminish the need to remain alert for signs that a circumstance exists or is developing in which that confidence is erroneous or misplaced

Communication

Many of the domains the HRO research explored had almost constant communication across multiple channels, even during periods of less work or slower activity. This finding is consistent with other high consequence domains like space flight. Rochlin explains that this constant communication helps to maintain the integration of the collectivity and also reassure individuals that such collectivity is intact.

Though the communication structures were different in each field examined, all operators expressed the idea that safety emerged through their communications and interactions with each other. Beyond just keeping the groups integrated, every operator said in some way that free flowing information was important because experienced operators could often discern something critical from something that might at first seem trivial.

This is also consistent with the way that mission control works and matches some of my experience in EMS operations as well.

Myths and Heroes

Where the “locus of responsibility” lies in an organization and how organizations think about individual action, shapes the stories and myths that are created and influence incentives and future behavior.

Broadly, there are two types of organizations, those that encourage “hero stories” and those that discourage them. Rochlin gives the examples of military and fire fighting organizations as those that encourage hero stories. Though in my experience and those I’ve interviewed about this, this is something that is changing and such hero stories are being discouraged. When “hero stories” are encouraged, the organization is encouraging “extraordinary individual performance.”

In contrast, those that discourage such stories, such as air traffic control, often use “hero” as a critique where it describes an operator that acts without thinking of the group. Both views shape the culture in their own way and dictate how non-routine work is performed. In a hero culture responsibility for non-route tasks fall to the individual. The organization at large seeks heroes and nurtures them.

In antihero culture threats or non-routine work are to be analyzed then made route. In this view, heroic action itself is a threat. Here responsibility for the non-routine work is with the collective group. Rochlin explains that in both cases, the constructed narrative becomes one about the performance of the org, not the individual. This means that even when individuals are blamed the groups interviewed still held the organization as a whole responsible as well as the individual.

This is an incredibly important idea for SRE and similar teams. If you lead or are starting an SRE team this is something I urge you to consider. If you’re part of one now, you likely were able to quickly identify which one your groups is. Rochlin doesn’t say one is necessarily better, but after working in different domains, both tech and medicine that have exhibited both and interviewing many others in similar positions, I rarely find a team where the hero culture is ultimately beneficial for the team or the individuals.

As with these hero stories, Rochlin says that “safety is in some sense a story a group or organization tells about itself and relation to its task environment.” Rochlin also relates this to James Reason’s analogies around health. That sickness or lack of is different form wellness, just like safety is different from a lack of error. In some ways Rochlin said, wellness too is a store we tell ourselves about our relation to the world and others, so it is with safety also.

Takeaways

Safety is more than just avoiding error or managing risk.
The search for safety is not just a hunt for error
Safe operation of a system is an “interactive, dynamic and communicate act”.
Previous views of safety including ideas of “safety culture” don’t tend to consider enough about individual and organizational agency
Confidence and how it’s managed or where it’s placed played a key role in how operators create safety
Whether an organization seeks heroes or discourages them influences the culture and the locus of responsibility.
“Safety is in some sense a story a group or organization tells about itself and relation to its task environment.”

Subscribe to Resilience Roundup

Subscribe to the newsletter.