Shifting Modes: Creating a Program to Support Sustained Resilience

Remember if you’re not yet an early adopter, there are only a few more issues available before the new community launches.  Sign up here.


Shifting Modes: Creating a Program to Support Sustained Resilience

This is an article by Alex Elman on how he has helped guide his organization from traditional safety approach to one that helps foster learning. It is part of a series of articles on resilience and offers some great tactical and strategic approaches to helping other organizations make the switch as well.

Imagine you work at a company that has 100% reliable piece of software. This sounds great, but there are tradeoffs. Success, including high reliability creates pressure to maintain that success as users become accustomed to it. Pressure to continue to ship it, ship it faster, with more features, continue to have high reliability

If the 100% reliability continues, then all time, energy, and focus (and consequently development of expertise) will be focused in response to the pressure that creates. After all, there are no incidents or retros to attend to.

The trade off is that when there is eventually an incident, employees will be very out of practice in how to respond. They won’t have had experience collaborating and working together through an incident.

They won’t have had experience understanding the system they support through the lens of surprise and recalibration experiences like incidents.

Without this experience responding to unforeseen disturbances to the system, there will be more incidents.

Two modes of operational safety

There are two modes or approaches of operational safety. The traditional one is a Prevent and Fix. In this mode incidents are seen as a sign of poor performance, it is a bad thing that happened that “should have” been prevented. This mode focuses on what breaks, how to avoid accidents, and uses strict controls.

The second is Learn and Adapt. In a Learn and Adapt approach, incidents are a window in which a greater understanding of the system can be developed. It acknowledges the role of everyday work in creating safety. Choosing the Learn and Adapt method means that the insight gained can help preventative and response work as well, just that accident avoidance is not the main focus of the organization’s approach to safety.

In Learn and Adapt mode, incident analysis focuses on human factors and works to inform about process, collaboration, and decision making.

Shifting from Prevent and Fix to Learn and Adapt

The shift from a Prevent and Fix mode to a Learn and Adapt mode is not an overnight change, we’re talking about organizational change after all.

Elman recommends 3 main ways to help drive this change:

  1. Find advocates in the organization. These are people who have a similar vision and goals. They can help create a larger movement as well as model the new desired behaviors.
  2. Communicate broadly. This applies not just to the message of waiting to shift the organization to a Learn and Adapt mode, but also to communicate broadly about lessons learned and fixes developed. This goes beyond simply putting information in some sort of retrospective document, it needs to be seen across the organization. Communicating repeatedly and broadly across multiple channels is the only way to help this.
  3. Normalize new behaviors. Making the switch in approaches to safety is more than just flipping a switch. That means there are new behaviors and new approaches that need to be used. Normalizing these new behaviors is critical in expanding adoption of the new mode.

Cultural traits of the learn and adapt mode

Elman also provides a few cultural traits that we can cultivate in our organization to help promote and maintain this mode. It’s important to note that this not cultural in the sense of a “safety culture,” as there is no one particular culture required to create and environment of safety, but cultural as in norms that exist and are seen in the organization.

  • “Opportunity over obligation.” When people see a task as an opportunity instead of an obligation, they approach the work differently. “Opportunity is taken whereas obligation is assigned.” It is the role of leadership to bring attention to these opportunities and clearly define them as well as making them attractive to work on.
  • “Flexibility over rigidity.” Implementing too much rigidity on how people accomplish their work limits adaptation and surprises sources of resilience.
  • “Agility over speed.” Shipping things quickly is one thing, but being able to take a new stance, whether as a team or an organization as a whole, in the face of new challenges can be much more valuable.
  • “Trust over suspicion.” In the face of high consequence events and time pressure, it can feel easy or natural to cultivate a suspicious approach or mindset to others actions. Elman encourages that instead we try to operate from a position that assumes others are acting in good faith. This helps avoid judgement and blame.

Promoting and Nurturing a Learn and Adapt approach to safety.

Elman also gives us some guidelines on how we can promote the move to the new mode:

  1. “Normalize stating your assumptions as much as possible.” This provides an opportunity to detect mismatches and then discuss them. It also allows both you and others a chance to recalibrate your mental models when a mismatch is detected. This is similar to some strategies around common ground.
  2. Normalize asking lots of questions. “Curiosity is an important cultural trait that nurtures Learn and Adapt.” Displaying this curiosity helps normalize it and can also be a great way to address errors that you detect, to be “curious instead of corrective.” Being curious can also help uncover what mental models others have developed.
  3. “Normalize increased cooperation between roles that don’t traditionally work together.” The author gives the example of Engineering (or Product) and Customer Service (or Client Success or similar). Customer facing roles can be the first to notice or be informed about problems. They also have a unique perspective and have likely adapted their own fixes that may not be widely know. Working on how to make the communication between teams smoother and faster needs to be in advance of an incident, not during one.
  4. “Normalize sharing incident analysis deliverables with everyone in the company.” This comes back to communicating broadly. If something is discovered but just stashed in a doc somewhere where no one knows about it, it can’t be used. Elman quotes the STELLA report:

Postmortems can point out unrecognized dependencies, mismatches between capacity and demand, miscalibrations about how components will work together, and the brittleness of technical and organizational processes. They can also lead to deeper insights into the technical, organizational, economic, and even political factors that promote those conditions. Postmortems bring together and focus significant expertise on a specific problem for a short period. People attending them learn about the way that their systems work and don’t work. Postmortems do not, in and of themselves, make change happen; instead, they direct a group’s attention to areas of concern that they might not otherwise pay attention to.”

Takeaways

  • A “Prevent and Fix” safety mode is very common. It is one in which incidents are seen as a failure, a sign of bad performance.
    • You can tell an organization is in this mode when the focus is on things like:
      • Strict controls.
      • Accident avoidance.
      • What breaks instead of what works.
  • A “Learn and Adapt” safety mode is one in which incidents are seen as a chance to learn more about the system being supported. In this mode incident analysis focuses on human factors, collaboration, and decision making.
  • Making the switch to a learn and adapt mode, takes time, but it can be helped by:
    • Finding advocates.
    • Communicating broadly about the shift.
    • Normalizing new behaviors.
  • Success is what creates the ground work for incidents, so focusing purely on incidents is misguided.
  • Focusing exclusively on shipping more and shipping faster as opposed to learning or responding can cause future incidents to be more difficult since responders will be out of practice.

Get resilience engineering analysis in your inbox: