Richard Cook and Jens Rasmussen discuss the difficulties when systems move from a loosely coupled state to a very tightly coupled state and the effects that can occur as a result.
They use the example of a hospital, but almost everything about the hospital as a sociotechnical system is recognizable in our own systems and organizations in software.
The expression “going solid” comes from the nuclear power and management of a steam boiler. It typically operates one way as it is partially filled with liquid, but once it becomes completely full (“going solid”), it very suddenly behaves very differently than it did before. It is now in a state that is more dangerous and harder to manage.
While many of our systems today are already tightly coupled, we can still experience this sort of contraction where things become even more tightly coupled with similar effects. Also, if our system in question is already tightly coupled we may already be in the “solid” state and not realize it. This is one of the problems with a system “going solid,” that the system has changed how it behaves and we may not know and still treat it as if it were in a different, potentially safer state.
When a system, in this example healthcare, is loosely coupled, buffers are available, but as new technologies and management methods advance, inefficiencies are reduced. This is generally good, but comes at a cost of the buffers that helped protect the system from surges of demand or other disturbances.
As a result, situations occur in which activities in one area of the hospital become critically dependent on seemingly insignificant events in seemingly distant areas
We experience the same thing in software all the time. Something that previously seemed far away or of little consequence suddenly are affecting our system. Examples in our world, are things like cloud provider services, other software in our system we were unaware of, or even external software packages changing or being removed.
Using Rasmussen’s dynamic model of risk allows us to better picture some of the forces at work here. This may sound familiar Here is an example from the author:
In this model, to avoid accidents or incidents, the organization works to keep the operating point away from the boundaries. While this sounds simple, it is not easy. Where these boundaries are is not known, nor do they stay in a single place.
The boundary can be discovered somewhat through exploration, but because of its unknown, changing location it can be crossed accidentally as well.
As organizations work to keep the operating point inside the boundary to avoid accidents, a “marginal boundary” is created, where the organization is comfortable working, where the risk of some incident is seen as acceptably low inside of that marginal boundary.
In this model, a “near miss” occurs when the operating point crosses the marginal boundary and is recognized before an accident. Here the organization works to move the operating point back inside the margin. The organization also likely works to keep the operating point as near to the margin as possible for economic reasons. Resources and sources of resilience that might help move the operating point away from the marginal boundary can be sometimes viewed as waste because of this.
In SRE parlance, we could say that SLOs and SLIs are attempts at making this marginal boundary explicit and more visible.
In this model we can begin to see that some amount of incidents may in fact be useful, as without them, especially over longer periods Cook points out that the marginal boundary can begin to move outwards, where as incidents, especially painful or publicized ones can cause the opposite effect.
- Safety is a dynamic process, not a static situation.
- It can be modeled using Rasmussen’s dynamic model of risk.
- Tight coupling can provide economic and performance advantages, but can make systems more difficult to manage and assess, which can increase the likelihood of accidents.
- The shift from loose coupling to tight coupling can cause a loss of buffering capability that previously helped the system weather demand changes, like surges.
- Going solid is an expression from the nuclear power industry that refers to a water boiler changing state.
- When the organization works to keep the operating point inside the accident boundary, a marginal boundary is created where the risk of an accident is seen as low, where the organization is comfortable working.
- Systems often operate near the marginal boundary of safety, so that productivity is maximized while costs are minimized.
- The exact location of the boundary is unknown and changing and discovered through exploration and experience.