This is a paper by David Woods focusing on cognitive challenges that problem solvers face when responding to various systems.
Woods picks up from Jens Rasmussen’s work, which explored the “psychology of behaviour in complex systems” and asked questions like:
- What is complexity?
- How can we map the inherent complexities of particular worlds?
- What cognitive demands does a world impose on problem solvers?
- How do people cope with these demands and complexities in the process of performing adequately most of the time?
- How can we design systems and provide tools that expand human abilities to cope with complexity?
It’s those last two questions that I think are especially important and recurrent for us in software. If you’re an SRE type, I’d argue that continuing to answer those last two questions are the core of the work.
Woods focuses mostly on the first three questions, which paves the way for us to understand and answer from our own domain the last two.
Understanding which situations and challenges can take place in your world allows you to better design or evaluate tools to support those cognitive processes.
“we need[…] to understand human behavior in complex situations. Such an understanding is essential of advances in machine power are to lead to new cognitive tools that enhance problem-solving performance.”
There are four dimensions that can affect complexity and consequently what challenges can arise:
- “Dynamism” or how dynamic a world is.
- how many parts there are in the given system and how connected they are
Any problem solving world can be mapped to some point along these four dimensions.
This is useful as understanding what sort of world we are talking about or will face can help us be aware of hazards and design tools and strategies to fit. These are not simply additive though, that is, though I’ll break these down as separately, they relate to a lot to each other:
a world that is dynamic and highly interconnected is not the sum of a world that is only high on dynamic and a world that is only highly interconnected.
This is where we live when supporting or designing software systems. Tasks can overlap and new events can happen at any time.
This is where “problem solving incidents unfold in time and are event driven.”
Events can happen at indeterminate times, which means there can be:
- Time pressure
- Overlapping tasks
- The nature of the problem can change
- Monitoring requirements can change
Because of these changes, it’s possible that the problem to be solved can change as an incident goes on. In order to continue to be effective:
“problem solvers should be opportunistic and flexible in order to detect and to adapt to events which require revision of situation assessment and plans.
This could mean changing things like:
- Understanding of the situation
- Evidence gathering strategy
- Evaluation tactics
- Response strategy
To do otherwise is seen as a fixation error.
This means that “there is it no one, single diagnosis task in dynamic worlds; rather there is a continuing need for situation assessment (only one element of which is diagnosis of correctable causes of disturbances.”
Woods tackles the myth that responding to an incident means making a single diagnosis, then generating a plan, then executing it. In actual incidents this doesn’t hold up at all, real incidents require tracking how the incident develops over time.
This dimension itself has four parts:
- The number of parts
- How complex those parts are
- How connected those parts are
- What kind of interconnections exist
What a problem solver knows about the way the pieces or “sub-world” (as some parts are likely to be complex enough to warrant their own ranking along these dimensions) will greatly influence their performance.
With high interconnectivity, expect actions to have side effects for other parts of the system. A common error that can occur here is missing side effects.
“Knowledge about the kinds of relationships between parts (e.g. goals, inter-goal constraints, functional decompositions, alternative versions, requirements, post-conditions) and about the state of the parts.”
In this sort of world problem formulation is a very important cognitive task. This is because multiple disturbances can exist and interact even if of the underlying faults are unrelated if it the parts are highly connected.
Whether or not separate, multiple failures exist requires using judgement about how the disturbances relate, which again relies on having knowledge and (perhaps multiple) mental models of how the various parts fit together.
When you combine this world with one that is dynamic, the “disturbance management cognitive situation arises.” Here a problem solver not only has to cope with the issues of the disturbances themselves, but they must also find diagnose and address the fault that is causing it.
Some strategic responses available to them are:
- Attempt to adjust the disturbed process.
- Find and correct (if possible) what produced the disturbance.
- Respond to cope with the effects of the disturbance (because of insufficient time or ability to repair the affected process).
In disturbance management, now not only is knowledge about the interconnected parts important, but also knowledge about how much time various potential fixes could take, e.g. “Will I be able to adjust or repair the process (implying knowledge of why it failed) before undesirable consequences occur.”
Interconnections also effect the cognitive work situations of diagnosis. If multiple parts are capable of affecting a single goal or subsystem, then any one of them capable of affecting the goal or subsystem.
“Breakdowns in cognitive processing at this level involve:”
- inadequate knowledge (e.g. relationships among units, inadequate mental model of process, knowledge of time responses)
- failures to consider side effects
- requirements and post-conditions
- failures to revise focus (one kind of fixation)
- prospective memory
- poor strategies/use of external guidance for selecting a focus (e.g., working on the cause of one disturbance when the consequences should be treated first, or working on a more detailed issues without ensuring that relevant higher order issues are stable)
That may seem like a lot, but as the author points out, these all have a common theme: “when confronted by a world with extensive interconnections between large numbers of parts, problem-solving errors often involve mistaking one factor related to the state of the world as the single explanation for that state of the world.”
That also strikes me as a great way to describe the software systems we build and care for, “extensive interconnections between large numbers of parts.”
Our own tools can cause some of these failures and fixations though, especially premature localization to a single factor: “vulnerability to the premature localization error is high when highly interconnected worlds are simplified.”
I know that I’ve fallen into that trap before, trying to represent highly interconnected worlds in a more simplistic way (cough grafana dashboard cough) only to accidentally encourage such a localization by it’s (unintended) users.
Part of the cognitive work of incident response is gathering and integrating evidence about the state of the world. There are several factors that make up these demands:
- How much collection and integration is needed to answer questions that arise about situation assessment.
- The kinds of integration required.
- How the world can change while these processes are happening.
When data is uncertain there are extra steps required to go from looking at or obtaining that data and to answer about the world. Further, some data won’t fit, blind corners, red herrings, etc. will exist.
In this world, there will be actions taken solely to gather diagnostic information (as opposed to action to address the problem directly).
When you combine this with a world that is also dynamic, developing strategies for gathering data becomes even more important. In these cases
When risk is present, decision making gets more difficult as the costs of the different possible outcomes of the decisions must be considered as well.
Of course in our world of software systems, we are almost always going to be fairly high on this dimension. As a result, I won’t say much about this here, but it’s important to know how it affects decision making.
Thinking about the worlds that we face (and create) along these lines can helps be aware of the challenges and pitfalls that can occur. They can all affect what a good strategy might look like and can help us decide what a good supporting tool (or a harmful one) can look like.
We can also use this language and point of view to further discuss and design strategies and tools beyond worlds being “simple” or “complex.”
“Analyses of this kind contribute to the development of a cognitive language of description of problem solving situations and strategies.”
- Knowing what cognitive challenges that responders can face allows better design and analysis of support tools.
- Problem solving worlds can be measured along four dimensions:
- number of parts
- “Simple” worlds are low on all four dimensions, are whereas complex worlds are high on all four.
- Though we can rank a problem solving world along these four dimensions and talk about them separately, they can be interrelated.