Systems Thinking for Safety

Thanks to everyone who stopped by to say hi at SRECon! It was great seeing everyone!

Systems Thinking for Safety: Ten Principles A White Paper Moving towards Safety-II

This is a paper by Steven Shorrock, Jörg Leonhardt, Tony Licu, and Christoph Peters for Eurocontrol. It’s about introducing many of the ideas we discuss here.

This might seem a bit backward. To look at general principles from the lens of a different domain, but it’s a useful way of seeing how others explain this stuff to different audiences.

This is a whitepaper, intended to be a very light overview, but it is still 40 pages.

It goes over 10 principles and gives a few bits of practical advice for each one along with a quote from someone in the field.

I think this paper is interesting because, even though air traffic is of course an area that cares about safety, it is also very much a business so this isn’t purely academic like looking at the domain as a whole might be.

I think this paper could be very useful for early reading in a reading group or distributing to folks who don’t “get it” yet. It also opens with an “Executive Summary,” so may also be useful if you’re trying to get buy-in from leadership.

I’m also hoping that this issue itself can provide a light introduction to these principles as well for folks just getting started or revisiting this world.

I’d say that if you’ve been a reader for a while and have also read Donella Meadows book on systems thinking, there probably isn’t going to be any new material here for you, but I do think even then, there is strategy to be learned in how to deliver this stuff.

I’m reminded of a quote I read in the intro to the Joint Cognitive Systems book where Cassirer is quoted:

“It is, … the fundamental principle of cognition that the universal can be perceived only in the particular, while the particular can be thought of only in reference to the universal.”

This is something I see come up all the time and get questions about. It’s all well and good, but how do you apply it? How do you spread that knowledge without just asking people to read a ton of PDFs?

This is one answer, though it’s far from the only answer. I’ll be going over each principle briefly, many of them are likely familiar if you’ve been a reader for a while.

In addition the 10 principles they suggest a “foundation of system focus.” Which is of course necessary, if you aren’t willing to look at the system as a whole, then you won’t be able to apply the principles or even see their relevance.

1. Field expert involvement

I’ve talked about this one a lot, so I was glad to see it highlighted first.

“Field experts” is what this paper calls the sharp end.

It might seem obvious to include the people doing the work (in this case air traffic), but as you go through higher levels of process, they tend to be involved less and less. This principle is an encouragement to remedy that.

Of course, people throughout the organization, the “blunt end” included are experts and specialists in their work, but they are unlikely to be experts in the work of frontline operators.

The “practical advice” on this principle, is pretty much just what the title says, to involve the front line operators in things like learning and investigation.

2. Local rationality

This one probably is very familiar as others have said this a lot and you’ve been a reader for a while you will have seen it too.

Local rationality just means that, our rationality, the way we think, reason, and make decisions is shaped by what is local to us. And what is local to us includes things like what we need to do (demands), our knowledge, and our context.

At its simplest we can say local rationality means that people do things because it made sense to them at the time for some reason. This is true whether we later learn was “good” or “bad.”

This is important because if it made sense to one person, it could make sense to others. Again this applies if its good or bad.

This principle leads us to investigation of how some action or strategy or behavior made sense at the time. Maybe we want to make a behavior less likely, or maybe we want to encourage it! Either way to truly understand why someone did something we need to understand how it made sense.

This also points us at another important point, if our rationality is local, then that means there can’t possibly just one single viewpoint. As we investigate and try to understand events and work, we need to seek those multiple perspectives, since one single perspective or truth doesn’t exist.

3. Just culture

We cannot decompose complex systems to their parts and then asses the parts for safety. We’ve discussed this in The Role of Software in Spacecraft Accidents and elsewhere.

This applies not to just machine parts but also human parts of the system (that’s why it’s sociotechnical after all). This means that we can’t say part of how someone performed is only attributable to them, it must be looked at in the context of the system as a whole.

Examining that whole is part of a just culture. They define it as:

a culture in which front-line operators and others are not punished for actions, omissions or decisions taken by them that are commensurate with their experience and training, but where gross negligence, wilful violations and destructive acts are not tolerated.

4. Demand and pressure

If you’ve read the Donela Meadows Systems Thinking: A Primer, this one will seem very familiar.

Demand and pressure are important parts of understanding a system because a system responds, for good or ill, to demand.

Pressure can result from demands. The pressure can vary in terms of which resource is being pressured, like time pressure for example.

The system (and people as a part of it), respond to that pressure. This is most often visible in the sacrificing of long term goals for short term ones as pressure rises.

It’s possible to consider 2 types of demand from an organizational perspective.

One is “value demand” and the other is “failure demand”…

This is just a way of saying the organization wants somethings to happen and wants other things to not happen. They are both wants.

When trying to understand a system and how or why it performs in someway its important to examine how a system meets (or doesn’t) demand and how it responds to pressure.

5. Resources and constraints

We’ve just talked about satisfying demands, but the ability to meet that demand hinges on there being enough resources to do so.

Resources are anything that you need to do some part of the work, some of which are consumable, some of which are durable. They can include things like, material, equipment, software, time, or information.

As both demand and resources vary, people also vary how they do work by adjusting and adapting to the situation. This performance variability is an important part of work as part of the system. There may of course sometimes be unwanted or negative variability in performance, but most of the time (all the times there aren’t accidents or incidents), this variability provides flexibility in the system, which must be present in order to meet demands.

They quote David Woods:

People create safety under resource and performance pressure at all levels of socio-technical systems

In order to understand how that safety is created, you need to understand what resources are available and what constraints exist, both during normal times as well as how it was during a particular incident.

6. Interactions and flows

Part of viewing a system as a whole is looking at how work flows through out it and the interactions that take place along the way. It can seem like some work takes place only a way that affects only a small area, but there is always some other area that is affected, even if we can’t see it. Interactions being non-obvious is a part of complex systems.

These interactions can also create goal conflicts which can lead to different parts of the system working at cross purposes, one of the ways that adaptive systems fail.

This leads to one of the ways that we can affect the system as a whole, by working to change flow, instead of just focusing on a small function. To do so though, we need to understand what the purpose is of the work in each part. And we get that by working with the people on the front lines, which is Principle 1, field expert involvement.

7. Trade-offs

In complex systems, work is always underspecified. This means that most cases, we can’t just list out the steps to do some job. Between the goal conflicts, expertise needed, and demand changes, the work will differ from what was prescribed.

This leads to situations where the options that are available are all suboptimal in some way. We have to make a choice, a trade-off, anyway though. This is a very different view than the more traditional one that says work is about compliance to some prescriptive rule set.

Since no resource is infinite, then trade-offs are an inherent part of work within the system. We can’t get away from it. But we can understand more about how they are made.

An important trade-off to understand is the “efficiency-thoroughness trade-off” or ETTO from Erik Hollnagel.

ETTO helps to frame how people and organisations try to optimise performance; people try to be as thorough as necessary, but as efficient as possible.

The authors highlight and important step as it relates to trade-offs, “Get ‘thick’ descriptions,” one that provides information about context, not just a behavior on its own.

Performance variability

Similar to trade-offs is the idea of performance variability. As we discussed, work is always underspecified, so you can’t contain the whole of the work in a list of procedures, no matter how long system are. This means people are always adapting in some way, even if they’re just small adjustments.

This is another part inherent to work in complex systems, something else we can’t get away from and probably wouldn’t want to even if we could. Without this variability, success would not be possible.

The authors provide some questions to ask:

“Is performance variability within acceptable limits?”
“Is the system operating within acceptable limits?”
“Are adaptations and adjustments leading to drift into unstable or unwanted system state?”

For more on how you can influence the system, especially if the answer is yes to the 3rd question see Safety II Professionals: How Resilience Engineering Can Transform Safety Practice.

Again, remember that if you do find variability that is undesirable, the best action is to act on the system, not one individual.

What constitutes drift will of course vary across organizations. This means you’re likely well placed to think of what sort of things constitute drift in your domain.

Emergence

We’ve talked before about the idea that you can’t decouple parts of the system to assess safety. Emergence explains a bit about why that is.

The authors quote David Woods:

Emergence means the simple entities, because of their interaction, cross adaptation and cumulative change, can produce far more complex behaviors as a collective and produce effects across scale.

They authors cite the Mars Polar Lander crash as an example of emergence, which I’ve covered as part of my analysis of The Role of Software in Spacecraft Accidents.

Emergence is especially evident following implementation of technical systems, where there are are often surprises, unexpected adaptations and unintended consequences. These force a rethink of the system implementation and operation. The original design becomes less relevant as it is seen that the system-as-found is not as imagined.

Equivalence

This is a very important principle that we can begin to see after examining and understanding the other ones. I also think that this one is likely the hardest to understand and convince others of.

Success and failure come from the same source – ordinary work.

There aren’t two separate types of work, success and failure, good and bad. These are labels and outcomes that become clear to us only with hindsight. As the local rationality principle tells us, had someone known a particular action would have led to a negative outcome, they wouldn’t have chosen it, something made it make sense at the time.

When things go wrong in organisations, our assumption tends to be that something or someone malfunctioned or failed. When things go right, as they do most of the time, we assume the system functions as designed and the people work as imagined.

This principle reveals a lot to us about how we can effectively investigate and interact with the system. If both failure and success are equivalent, they both come from normal work, then in order to understand they system more completely, we must investigate normal work.

Questions?

If you have your own questions, you can join the commutity call/office hours next week, where I’ll be talking a bit more about this and also taking your questions.

Takeaways

Though this paper comes from aviation, instead of looking at the domain and extracting the broadly applicable ideas, these principles are broad and only minimally explained in terms of aviation.
- This gives us a way to see one way we might explain these principles to people in our industry.
The principles are field expert involvement, local rationality, just culture, demand and pressure, resources and constraints, interactions and flows, trade-offs, performance variability, emergence, and equivalence.
All of the principles rest on a system focus, without this its hard if not impossible to apply the principles.
In order to make meaningful change, we need to act upon the systems, not the individual parts (or individuals themselves).
Ultimately, in order to better understand the systems we work on and in, we need the examine normal work.

Subscribe to Resilience Roundup

Subscribe to the newsletter.