Measuring System Resilience with the Resilience Analysis Grid

Measuring system resilience with the Resilience Analysis Grid

“What metrics do I use to show or assess resilience?”

That’s almost always the first question the I get when I talk to folks who are new to resilience engineering or are introducing it their organization. Unfortunately, its a bit of misleading question. It already starts off on the wrong foot.

Resilience is not something sitting in your system waiting to be discovered and mined like gold. It is not a property of the system, resilience is not something you have it is something that you do.

Even with that in mind, as a leader who is creating a team or initiative around resilience, there is almost always going to be the question of “how can we see if this is working?”

First off, I want to say, that’s a different question, though it often gets reduced to the “how do I measure this/what metrics do I use” question.

Before we dive in, lets address that, how do you know if your resilience efforts, whatever they are are working? Your best indicators are rarely going to be qualitative. When I talk with leaders about this, I often start with how they measure other things. For example, how do they measure the result of training and development that they invest in?

While this question, when it comes to resilience can seem like a new one, it really isn’t. There are investments and measures that the organization is already using that don’t reduce to a single number, and perhaps don’t have a number at all.

But I know that folks say that they have to have one (or two, or ten). For those folks, you’ll probably want to check out my analysis of Sidney Dekker’s Capacity Index and I’ll be highlighting another approach here. This one is from Erik Hollnagel, the Resilience Analysis Grid.

I’ve studied this method as well as four different case studies across multiple domains that have used and assessed this method, so by the time you’re done reading you should have a good sense of where this method shines, where it doesn’t, and when it might be a good fit for you and your system.

As I said, resilience is a verb, because of this, Hollnagel provides a way to measure the resilience potentials or abilities, the things that your system needs to have in order for resilient performance to be possible.

These are four abilities, the ability to:

Respond
Monitor
Learn
Anticipate

These are the things that are assessed with the Resilience Analysis Grid.

Assessing the four abilities

The implementation of the RAG is as a series of questions that are adapted to the domain and the specific system by the people in it. The point of it is not to give a single, absolute number or point score, but to show a profile of how the system is doing on the four abilities.

You need to ask these questions continuously, not just about one point in time. This way you can see change over time. In this way, the RAG itself is a form of monitoring.

In order for it to be useful, Hollnagel says that you have to be able to rate the answers. He suggests a Likert like scale:

Excellent
Satisfactory
Acceptable
Unacceptable
Deficient
Missing

He warns about aggregating the results into a single number, though he acknowledges it may be very tempting. The problems with doing so are the questions aren’t weighted and the different answers have varying gaps between them. Instead you could create a radar chart to help visualize how the system is doing on the abilities.

I’ll go over the four abilities and some of the issues that your questions should address.

The ability to respond

In order for the system to have the ability to respond, the system has to be able to sense that there is something to respond to. It also needs to be able to sense and decide how to respond and when to start and stop doing so.

This requires resources be available and plans, or the ability to free up resources.

Questions around this ability might address issues such as:

What events does the system have a response prepared for?
How were those chosen? (Through risk assessment, regulatory demands, expertise, some industry standard?)
At what point is a response activated?
- Is that an absolute threshold or does it depend on other factors?
How fast is a complete response?
- How long can that be sustained?

The ability to monitor

Resilient performance requires monitoring. Monitoring things like how the system is doing and what the environment is doing.

This ability allows the system to take advantage of opportunities, not just respond to bad things.

Leading indicators tell you about events that are about to happen. But in order for them to be useful, you have to have some idea how the system functions. Without that you can’t interpret the indicators.

most systems rely on lagging indicators; such as accident is statistics

The smaller the lag the more useful the indicator. Usually, the longer the lag between the indicator and minutes noticed and interpreted, it can be too late to intervene. On the other hand the indicators become more clear as more time goes on.

Questions around this ability might address issues such as:

What indicators are monitored?
- How was that list generated?
- How often is it revisited?
  - Who does the revisiting?
  - Does the organization provide support for this revisiting?
How many leadings vs lagging indicators are there?
How are indicators confirmed to be valid?

The ability to learn

Since the environment changes, the system must have the ability to learn.

But, what is easy to learn may not be meaningful to learn.

But compiling extensive accident statistics does not mean that anyone actually learns anything. Counting how often something happens is not learning. Knowing how many accidents have occurred, for instance, says nothing about why they have occurred, nor anything about the many situations when accidents did not occur. Without knowing why something happens, as well as knowing why it does not happen, it is impossible to propose effective ways to improve safety.

By the numbers alone, it makes more sense to learn from success as opposed to failure, since that is what is happening much more often.

Questions around this ability might address issues such as:

Which events are investigated and which are not?
- How is this decision made?
  - Who makes it?
Are there attempts to learn from success or only from failure?
Is there organizational support for learning and analysis?
On what level or levels does the learning take place?

The ability to anticipate

Monitoring looks at the (usually recent) past, whereas anticipation looks to future, for both good and bad things.

In complex systems, this ability is especially difficult. Only some of the ways that the system functions is known and the system continues to change.

Questions around this ability might address issues such as:

What kinds of expertise are used to look forward?
How often are future opportunities and threats assessed?
How are expectations about the future shared within the organization?
What sort of model or assumptions about the future exist?
How far ahead does the system look?

Questions to ask

Though I listed some questions in each of the sections above, those are just topics or issues that you’d want your questions to address. That’s one of the difficulties with this method, is that while there is some guidance, you need to figure out what all of those things mean in the context of your system.

When it comes to developing and customizing the questions, Hollnagel says they:

should refer to concrete relations or characteristics of the system’s performance, to something that the respondents have experience with or something that is described in the system’s documentation

This is difficult for a few reasons, one is that if you have multiple groups or pillars that you want to gather this data from (as I hope you do!), you’d need to devise a separate set of questions for each (e.g. SRE, security, etc…).

As far as making the questions from things in the documentation, this shifts the already indirect assessment closer to the question “How well documented is the system, specifically around the four abilities?” which is pretty far from where we started and may not be the most useful thing to know.

Ultimately this means that you’re really surveying people’s assessment and understanding of how the system works. I don’t think this is a bad thing at all, but its important to note. Since resilience is not a property of a system, its something that emerges, something that the system does, that means there isn’t some single truth or measure that you’re uncovering.

What to do?

As you can probably tell, I’m not super enthused with this approach. Why cover it then? Because its out there and people have tried using it, more lately it seems, and I want you to be equipped to understand where it is useful and where it isn’t. Its another tool in the toolbox for you to have.

What I really like about this is getting people’s sense of how the system is doing on the different abilities. Ideally this is taking place at multiple levels across multiple parts of the organization. This is a great way to get a sense of risks and opportunities available.

Where it can break down, both for us in software, and other industries (as evidenced from the various case studies), is that you can spend a lot of time and energy trying to get the questions “right” with very little feedback on if the answers are really related to the questions you have. To combat this, you could add a phase where you talked to folks first to help inform question development.

Additionally, the results don’t necessarily tell you where or what good interventions might be. In order to help offset this, results could be discussed with the various participants, who may have insight into system level changes that could be made or highlight features are effective currently that need to be preserved.

Takeaways

Resilience isn’t something you have, it’s something you do.
The resilience potentials or abilities are: respond, monitor, learn, and anticipate.
When examining the system to measure the resilience potentials, remember that the system is more than just the software and hardware, but has the people as well.
These forms of measurement need to be done continually, not just at one moment in time in order to be useful.
- Measurements such as these should also be made at different levels from different view points.
Hollnagel recommends basing the questions of the RAG on what is documented about the system, though this runs a risk of focusing on the wrong things.
Getting a sense of how the system is doing on the four abilities from various perspectives is very valuable and doesn’t have to be quantitative.

Subscribe to Resilience Roundup

Subscribe to the newsletter.