The strengths and limitations of teams for detecting problems

Hey folks,

This week we have a paper by Gary Klein from Cognition, Technology & Work. If you’d like to read it yourself you can get it from a 2 week trial of DeepDyve (I’ve not used them, just notice they have the paper) or if you’d like a copy for your individual use, please email me.

Failing to detect a problem can cause loss of opportunity; opportunity to mitigate or avoid consequences.

Teams have some advantages over individuals for this:

Teams can see a wider range cues than a single person.
They can have a wider range of expertise and perspectives.
A team is less likely to get stuck on a single interpretation
They can reorganize themselves in different ways
They can work in parallel

Teams, though, can be subject to their own issues with problem detection. Tasks must be broken down so multiple people can work on them. This means it will have to be reassembled. And of course there are costs to coordination. (For more on coordination see issues 25 and 26)

Most of the issues that teams encounter with problem detection occurs because of these two things: reassembling the work or the cost of coordination. As a result we can’t really say that teams are better at problem detection than an individual. Perhaps just a requirement or a better fit for some situations

On-going operations require teams. One individual can’t do it forever. Also, if an operation or task is critical enough, then some redundancy will be required in case that person becomes unavailable.

Problem solving is not a linear process, it happens in a cycle. Detecting a problem is part of that cycle. Often it may be the initiator to problem solving itself. In the course of trying to diagnose a problem other problems can be discovered. This is why it’s helpful to think of this as a cycle instead of just a sequence of steps.

The paper is focused only on some sort of initial indication that something is wrong, the steps before identification, not identification itself.

Problem detection in individuals can involve an “escalation of suspicion,” where subtle cues at first can be explained away or ignored. Or those subtle cues can cause someone to reframe the problem and see it in a better light.

It’s important to note that this paper is a preliminary investigation. They focused on identifying barriers to problem detection for teams, especially ones that occur because of a difficulty of coordination. In this paper, any barriers that were identified that also occurred in individuals were ignored so that they could find ones that only occurred at the team or organizational level.

The study looked at 24 incidents from different sources:

Nine from studies that used the Critical Decision Method
- Five of those where from a study of a Neonatal Intensive Care Unit (NICU)
Four were studies of military decision making
Four were from published events
- Apollo 13 launch and recovery
- Challenger (Diane Vaughn’s book specifically)
- USS Enterprise on Bishop Rock (Karlene Roberts' work)
- Failure to anticipate the Pearl Harbor attack (Roberta Wohlstetter’s work)
12 from Charles Perrow’s work on various accidents

“The purpose of this study was to provide a basis for formulating hypotheses, not to test them.”

Klein breaks down the findings into a few categories, based on what they affect:

the initial stance of alertness
the recognition of initial cues
the interpretation of the cues
the action taken

Phase of alertness

Teams that showed better common ground ended up being better at problem detection.

“Production pressure discourages vigilance for problems”

Klein points out that organizations who are facing the pressure of a schedule don’t want to be distracted, and as a result end up finding ways that make it harder to move to higher alertness.

“Team members face different consequences as a result of failure, and this conditions their level of alertness”

A study of U.S. Navy ships found that people who made decisions on ships that were under threat were at a higher alert level than other decision makers on other ships.

This seems intuitive to me, but what’s especially interesting is that:

“The officers being threatened tended to react more vigorously to threats, and the officers on other ships were more concerned with deconfliction”

Klein points out that a similar problem can manifest in organizations, where a team under threat has trouble getting others motivated to be “appropriately alert”. Especially when on team assigns another to monitor conditions where the monitoring team many not face the same consequences should a threat arise

Phase of cue recognition

This is the phase where team may be having trouble either detecting a subtle cue, recognizing a pattern, or if they do, they don’t share that detection with others. This is similar to what we I’ve written about previously

As we learned before, individuals often detect the signals, but the team doesn’t react to them.

“The high cost of sending information filters out important messages”

It takes time and effort for teams to get information that they need. Additionally if a team sends out a lot of information, then others may not give it as much attention.

“A team can fail to notify its members, assuming that they already have the information they need”

This is essentially the fundamental common ground breakdown

“A team member can assume that he/she is being kept informed that the absence of a message must mean nothing has happened”

This can be exacerbated during communications outages.

“There is a disconnect between the people who have critical information and the people who understand the significance of that information”

Klein provides an example of this in the Challenger disaster, during meeting between Thiokol and the Marshall space center.

An engineer from each knew that water was found in a joint previously. Because the time of the launch was schedule to coincide with cold weather, knowing this would have been important to engineers located elsewhere. But neither of the engineers in the meeting knew how the joint worked, so they didn’t know the significance of the information they had.

“The primary data gatherers and monitors in an organization are usually the least well trained”

This issue can occur when inexperienced team members don’t have the context or experience to notice that it’s significant that something didn’t happen, so of course, they don’t report this absence.

Klein gives an example from wildland firefighting.

The U.S. Forest Service was trying to improve team problem detection. They had noticed that firefighters were sometimes getting into dangerous situations because there was no one tracking accumulating risks.

To try and prevent this USFS gave firefighters checklists of conditions to watch out for. This didn’t work, so more items were added. Eventually it grew to over 50 items across different checklists. On top of that some of the items on the list would always be present anytime firefighters were needed, since by the very nature of their work, firefighters were always in dangerous situation of some sort.

This shows that checklists cannot substitute for building expertise in detecting the risky conditions.

“Unskilled primary data gatherers can not only miss critical cues, but also mask those cues”

This occurs with purely human teams and also, human-machine teams.

Here, Klein uses the example of commuter airplane accident, where different levels of ice on each wing caused them to lift differently. The autopilot compensated for this which masked any cues that the pilots would have had.

Once the autopilot could no longer compensate (a decompensation failure), it was too late to recover.

The FAA responded by banning autopilots in similar conditions.

“Rivalries between bureaucracies disrupt the exchange of data”

“Inconsistencies may not be detected if different team members hold the different data elements and do not compare notes”

Klein provides an example of army battalion participating in a planning exercise.

“The communication of suspicion is difficult”

This problem can occur if someone perceives a problem but is forced to articulate in an analytical way.

This occurs in all sorts of work, Klein provides examples from NICU nursers, firefighters, and astronauts.

NICU nurses had trouble getting the doctors in charge to listen to them. They’d perceive something wrong, but struggled to find words to convey them.

Firefighters in the example faced a similar challenge, a seasoned officer requested permission to withdraw a crew from a roof that he thought to be unsafe. But that day he had a supervisor who didn’t know him and as a result didn’t trust his judgement. The officer ended up physically finding him on scene instead of continuing to converse by radio and was eventually able to convince the supervisor.

Phase of sensemaking

“Multiple patterns make it difficult to converge on a common understanding”

Klein uses the example of the Perl Harbor attacks where there were some signs that the attack was pending, code book burnings, radio changes, etc…

BuUpthere were also patterns that suggested the USSR was going to be the place attacked. Further, the harbor wasn’t deep enough for the torpedoes that they Japanese were known to use. But the US didn’t know that the torpedoes had been modified to overcome the limitation.

“The team members may not realize that common understanding has been lost”

There isn’t an easy way to make sure everyone has a shared understanding. Communicating and confirming every bit of data every step of the way is impossible and would prevent anything from ever getting done. Some amount of assumptions are necessary.

He quotes Karl Weick:

“The more advanced the technology is thought to be, the more likely are people to discredit anything that does not come through it”

“Teams usually do not appoint a situation awareness specialist to keep track of the big picture”

In various forms of incident response, including in software, this tends to be the role of the Incident Commander.

But for large organizations during regular work cadence, this can be especially true. Teams can be competing against one another or have some sort of rivalry that can prevent anyone from seeing the big picture.

“Team structure can reduce the level of expertise available to detect problems”

Continuing the idea having unexperienced team members do all the data gathering, this results in the more experienced team members not being directly exposed to the situation and having to use second hand information.

“Blunting and repression cascade through a team, strengthening its fixation on an inaccurate situation assessment”

Similar to individuals, the pressures of production can create a culture where problems are ignored in favor of just pressing on.

Phase of action

Most teams have some amount of inertia. They are structured and act in such a way that they can “resist the distractions of calls for problem identification and diagnosis.”

Klein again turns to wildland firefighters. Citing a situation where a team will not change their response despite whether and fire conditions changing.

“Trouble assessing the credibility of the different members”

Even when a team does detect a problem, when the person or system who detects it is considered to have low credibility, the rest of the team will decide whether they are going to support or prevent further diagnosis our investigation.

Klein uses the example of a fire in a Philadelphia office building in 1991. The fire alarm went off, which makes this seem like a situation that should have been simple, but of course there were other factors at play, namely that the cost, perceived or otherwise, of reporting a false alarm were considered very high.

At the time of the fire, the building was occupied only by security and cleaning crews. Some oil soaked rags began to burn, setting off the fire alarm.

The security team in the lobby got the alarm, but may not have had a set procedure on how to react. Further, for alarms that involved multiple stories, typically someone would go to investigate themselves prior to calling the fire department. This is due to a large number of false alarms. it’s possible that the guards didn’t know that there were a small number of detectors in the building, making it less likely that an alarm would be triggered accidentally.

Upon investigating the 22nd floor a guard became stuck in the elevator when smoke cloud of the sensor preventing the door from closing, fortunately they were able to radio for help and have the elevator manually recalled.

At the same time someone passing by had seen the fire, actually seen the flames coming out the windows of the building, and went to a payphone to call it in. Emergency services dispatch, expecting a call from the security team had there been a fire, did not immediately dispatch a crew. Apparently, false reports from payphones were common and were not taken seriously unless they could be corroborated by another source.

In this example we can see credibility of both humans and systems coming into play. This is something that I’ve seen happen a lot in software teams. Most of us have experienced an issue where will believe one system, but not another. Or even going so far as to discount some information because it doesn’t come from the system we expect.

If this procedure with the office building and the fire sound strange to you, I can assure you, that at least in relatively recent years in some cities this is still the procedure. In my experience, it is very common.

Takeaways

Barriers to problem detection in teams are different than that individuals and are emergent properties of the dynamics of the teamwork.
Most of the barriers occur either because of the costs of coordination or because of the need to reassemble previously decomposed tasks.
The shape of the organization and its competition for resources can affect how teams work together or don’t
Unexperienced team members, human or machine, can mask signals if they’re expected to do all or most of the data gathering
Rarely do organizations, especially large ones, have someone specifically designated to see the big picture.
Team structures often mean that inexperienced members are doing information gathering, causing more experienced members to have to use second hand info, preventing the from forming their own mental models about what’s going on.
Teams and individuals evaluate the credibility of a source, human or otherwise, to decide whether to proceed with investigation.

Subscribe to Resilience Roundup

Subscribe to the newsletter.