Learning From Organizational Incidents: Resilience Engineering for High-Risk Process Environments

This week I have an article from the Process Safety Progress which is a publication from the American Institute of Chemical Engineers by Stefanie Huber, Ivette van Wijgerden, Arjan de Witt, and Sidney W.A. Dekker.

In this paper, the authors went to do what they call a “resilience engineering safety audit” on an anonymous chemical company. Though the company is anonymous, we do know a little bit about the company. It has more than 300 employees and was founded more than 100 years ago.

I found this really interesting since they basically started with this idea that the research supports resilience engineering approaches as a way to improve, but they also wanted to apply it themselves in practice and had chosen a high consequence area to do it.

Learning From Organizational Incidents: Resilience Engineering for High-Risk Process Environments

This audit took place over four days and initially involved interviewing operators and middle managers across different plants on the same larger site. The company ordered this audit because they wanted to see if there were areas in which they could improve safety.

They’d previously experienced very high levels of safety with very few incidents. So the question that they were hoping answer as a company was whether or not there was room to improve, as well as to get some outside perspective on their current processes.

The researchers developed a questionnaire for their interviews based on different indicators that research has shown to be markers of organizational resilience. There were six primary dimensions:

Top-level commitment (e.g. Do you think your boss appreciates your work?
Just culture (e.g. Do you feel comfortable reporting safety issues/problems to your boss?)
Learning culture (e.g. Do you feel the discussion about risk is kept alive in your company?)
Awareness and opacity (e.g. Do you know the major safety concerns the company has to deal with?)
Preparedness (e.g. Do you feel ahead of upcoming problems?)
Flexibility (e.g. Do you have any slack resources available to cope with sudden trouble?)

Then they analyzed this data and correlated it across the interviews and looked for some patterns. Some of the questions they hoped to answer include:

How could these phenomena have been generated?
What underlying principles can be derived from these findings?
What common principles these results identify?

This was in order to get the “second stories” (seeIssue 17for more on second stories). The article doesn’t include results of the survey directly, but the patterns that resulted and some of the answers to those questions, which I think is more useful.

As one might expect, the operators and managers were making safety versus production trade-offs. Both operators and managers identified safety and production as the two most important goals and, as is typical of many industries and organizations, felt the need to meet those simultaneously.

Also as is common, this conflict was not resolved a high level, instead getting pushed down to the sharp end of the organization making them have to resolve this trade-off over and over. As a result, even though there’d been no large-scale incident for some time, there were a number of incidents that were identified as small that got neglected and normalized as part of the work.

One way this materialized was in operators not using their safety gear in order to save time in production. This is an efficiency-thoroughness trade-off (ETTO) as Erik Hollnagel would label it. Despite this, “safety is always first” was said by people at multiple levels of the company.

As Hollnagel said, “If anything is unreasonable, it is the requirement to be efficient and thorough at the same time”. This needs to be resolved by making those sacrificing decisions in advance, so safety if is to be first, put it first and relax pressure for efficiency. This is the domain of upper management, they can mandate this. That seems simple enough on paper, but the authors dug further to try and discover why it was so difficult.

One thing that makes it difficult is when external pressure is internalized by the operators. As you move farther down the organizational hierarchy there are more and more goals that must be accounted for. The authors discovered that in the face of these goals and conflicts, they become internalized in the front line operators, they were trying to do their best to resolve them.

The authors determine that for them to be successful and still put safety first “every level in the plan and company must recognize the hazards of these external pressures and seriously internalize safety first.”

Another thing that contributes to this difficulty is normalizing risk as part of the job. In this study, a lot of employees said that the accidents happen anywhere from 0 to 5 times a year, but at the same time, almost everyone said that small accidents or incidents were happening all the time.

The operators in this company had normalized risk to such a degree that things like getting burned or getting acid in their eyes counted to them as only a minor incident. I should note that, of course, I know nothing about the work of a chemical plant operator, but those sound like pretty big things to me. The authors included an anonymized quote from in operator:

I have just experienced some small incidents. Scalding from hot steam or hot water are typical incidents that often happen. I was involved in a burning incident.

Also, what is difficult, is for the managers who could help prioritize safety to be aware of these incidents as they were rarely reported. The operators didn’t seem to consider them as incidents and actually associated them with normal work. As a result, what was an incident or accident became continually redefined because of how often these events were happening and as a result, these events became normal.

This even extended to things like the fire alarm. Employees had said that if the fire alarm went off, nothing would happen. No one would leave, they were so used to it going off in error (at least once a month) because of steam leaks. So not only is the risk itself being normalized but also in this environment it is normal to not follow the procedure.

This is especially understandable considering that this company had 400 procedures and another anywhere from 10 to 30 in each plant. On top of that, there were situations that made it impossible to follow the manuals, for example, when an operator had to run three different production lines by himself.

An operator summed up well a problem of relying solely on procedures to create safety boundaries: “instructions do not cover all issues—sometimes they are very general, sometimes they are too detailed and you can’t follow them”. The authors suggest that instead of looking at the adaptations that people make to create safety as a violation since there is always be a gap between the procedures written in the actual practice, that one instead look at closing that gap as something that makes safety. Instead, one could focus on making that apparent and creating a structure for learning and adapting.

The authors also suggest a “management of change” (MOC) process. This would prevent procedures from being rewritten in a way that could be unsafe. It does not address procedure following, but instead, would be a technical review to make sure that changes themselves, as suggested, were safe. Similarly to infrastructure as code, the MOC itself would be a procedure that would be kept up-to-date in this way as well and a change log with old procedures and old MOCs would be there as well.

Many of the operators expressed that they were unhappy with how the company learned from failures or rather that they didn’t. Operators, across the board, expressed a want to be more informed about failures that happened in other plants. Their current approach was for some information to be posted on their intranet. Often, this intranet wouldn’t tell you when there was a new posting, you’d have to constantly seek them out. Additionally, the reports were rarely produced with a level of detail that allowed operators to learn much. The authors conclude from this that people need to be more involved in the process. They should have the option to heighten their awareness that an accident could occur in their plant and in their work, as well as opposed just posting it on the intranet.

In this company, there is no structured learning, no way for the organization to learn as a whole. Mostly learning occurred in small groups or teams. A lot of the operators and managers even said that they just try to keep things in mind if they get in the same position later, but relying on people to try and remember is not a good overall approach. And it doesn’t leave room for the operators to share their knowledge. “Distributing information about failures should not solely hinge on the intranets and computers—more personal communication is essential here.”

Takeaways

The authors conclude from the study that thinking safety is something created by counting errors, or something inherent system, or created by simply changing small pieces, without individuals, procedures, or equipment is not enough. The system needs to be capable of adjusting to an ever-changing environment which is likely a better predictor of safety in the future.

The authors provide their own summary:

Safety must be first, but there are always uncertainties and goal conflicts that make this very difficult in practice.
All incidents must be reported and analyzed, but it can be very difficult for managers and operators alike to agree on what counts as an “incident.” Furthermore, analyzing an incident is not the same as learning from it; for this a whole suite of follow-up activities is necessary.
It would be nice to say that operating procedures must be used as specified, and only changed after an MOC, but in reality there is always a gap between written guidance and actual practice. The real challenge for an organization is to be sensitive to this gap, to find out where and why it exists and resist judging the operator for not following the procedures as specified, as reasons for this may lie buried more deeply in the organization or operation.
Plants need sufficient operators and managers to operate the plant safely, but definitions of ‘‘sufficient’’ are often negotiable and based on incomplete evidence.
Person to person safety meetings are needed and intranet/computer communication regarding safety should be discouraged or solely used as complementary source of information, for only then are there real opportunities for sharing narratives about risk that people can use for vicarious learning.
Operations, engineers, and managers need to constantly adapt to a changing environment—This is the key factor to a resilient organization.

**Who are you? ** I’m Thai Wood and I help teams build better software and systems

Want to work together? You can learn more about working with me here: https://ThaiWood.IO/consulting

Can you help me with this [question, paper, code, architecture, system]? I don’t have all of the answers, but I have some! Hit reply and I’ll get right back to you as soon as I can.

**Did someone awesome share this with you? ** That’s great! You can sign up yourself here: https://ResilienceRoundup.com

Want to send me a note? My postal address is 304 S. Jones Blvd #2292, Las Vegas, NV 89107

Subscribe to Resilience Roundup

Subscribe to the newsletter.