Resilience Roundup - Issue #6

Continuing our discussion of human error from last week, we’ll go back in time a bit to 1990 and hear from Jens Rasmussen, and then hear from C. Michael Holloway formerly of theNASA Langley Formal Methods team about how software can learn from other disciplines.

Total time to read: ~10 minutes


Human error and the problem of causality in analysis of accidents

Here Jens Rasmussen starts by giving some background on some of the ways in which “human error” has become an acceptable answer to the question ‘What went wrong?’ and then explains why its not a very good or useful answer.

He explains that previously, industrial systems were fairly easy to model accurately by various groups of professionals. This meant that hazards could be proven empirically by pointing to some bit of math or engineering.

In 1990, he was pointing out that we’re no longer in that realm:

The hazards involved in operation of large-scale systems lead to reliability and safety requires that cannot be proven empirically.

He shows that even the concept of what an event is can be difficult to pin down. If we include a lot of detail, then that “event” is less likely to occur again, as future events are unlikely to have all the exact same parts.

He goes on to say that typically we investigate accidents, by working backwards through a sort of causal tree. This thing happened because of the previous, continuing to work backward and branch.

Explaining accidents through cause will actually change how you analyze the accident in question. Similar to what we spoke of last week Rasmussen explains that typically one will examine the series of changes and will stop when something is found that matches what we expect to find.

What we use as the “stop-rule” is very rarely well defined and is usually different for each person and their purpose in looking at the accident.

He explains that how much we break down the details in an accident entirely depends on who we expect to read the analysis and what the analyst themselves expects to find. He uses the example of the statement “The short-circuit caused the fire in the house” to show how varied the explanations and analysis of this could be.

If you know enough about electrical systems to wonder why there wasn’t a breaker, then this answer isn’t really useful or acceptable to you. If you know a lot about building materials and wonder what materials were flammable, then you’d probably ask different questions and seek different information.

He also points out that analysis of accidents can be made for different audiences and for different reasons, for example to:

  • Place blame/responsibility

  • Explain the accident

  • Improve the system

If you’ve ever done a postmortem because policy said you had to, or because someone demanded answers, this is probably very familiar to you.

So this can further influence what your “stop-rule” is. If you’re looking to place blame, you likely stop when you find a human doing something that you don’t expect. If you’re looking to improve the system, you likely stop when you find something you know there is a “cure” for (and perhaps even call that root cause).

Further, he explains, we can’t really define human error very well. Especially in the face of resource limitations that exist in unexpected situations and the way humans learn and adapt. Or what about times when we use the “usual cues” when working, but there is some change or fault in the system. Now our “usual cues” didn’t work.

A better lens, he provides, is to realize that “it can be expected that no more information will be used that is necessary for discrimination among the perceived alternatives…in the particular situation”. So when we’re outside of those previously experience situations, and make a choice, we’re testing a sort of hypothesis. If it turns out negatively, then we typically label that “error”.

the smoothness and speed characterizing high professional skill, together with a large repertoire of heuristic know-how rules, will evolve through an adaptation process in which ‘errors’ are unavoidable side effects of the exploration of the boundaries of the envelope of acceptable performance.

So often, we’ll then hear a solution that is probably something like: Well, lets just make those boundaries as far away as we possibly can. Then things will be “safe,” right?

Rasmussen is already prepared for this. He points out that: “it appears to be essential that actors maintain ‘contact’ with hazards in such a way that they will be familiar with the boundary to loss of control and will learn to recover”. So if we make the boundaries really far away, it’ll be hard to sense where those boundaries are. And then when they are really far away, crossing them is more likely to be much more catastrophic and permanent.

If this seems strange, think about driving and how you learned to drive and improved your skill. You likely got time to experiment, in ways and situations that were fixable. Probably a parking lot, where you learned how much you could push the gas, how hard you should push the brake, how sharply you could turn and still retain control. You were exploring the boundaries.

Later on, you got better, and were able to perceive more things, and take in more inputs from the road. Its condition, weather effects, other drivers and were able to fine tune your behavior. As this happened it’s likely that you had some trouble, maybe hit a parking stop, lost traction, had a close call, maybe something more serious.

But this helped you learn. Rasmussen explains this too:

Some errors, therefore, have a function in maintaining a skill at its proper level, and they cannot be considered a separate category of events in a causal chain because they are integral parts of a feed-back loop.

He closes by suggesting that in light of these things, we’ll likely have to reconsider the things that we attribute to operator error and think about things on another level beyond just making humans more reliable.

A lot of the software industry, isn’t great at this, falling back to that mode of thinking we talked about previously, “if only we could get rid of those pesky, unreliable, humans.”

This paper is also interesting because it has a discussion section at the end from various members of The Royal Society that published it. Rasmussen then engages in a discussion with people from a hospital and a university discussing manufacturing, risk assessment, and patient care.

From Bridges and Rockets, Lessons for Software Systems

C. Michael Holloway directly relates some learnings that we in software can take from other disasters. How could I pass up a paper who’s abstract says, “software engineers can learn lessons from failures in traditional engineering disciplines”?

Specifically he relates learnings from the Tacoma Narrows bridge failure, and the Challenger disaster to software. He reviews each, goes over the lessons, and then some specific applications.

I won’t get into the specifics of the disasters so much here, nor does Holloway can read about that elsewhere.

First, he looks at the Tacoma Narrows Bridge, completed in 1940 a suspension bridge that was to be the alternative to taking a ferry across puget sound. The bridge was designed by one of the world’s top authorities on bridge design. Mostly what you need to know is that the designer of the bridge had used a theory called “deflection theory” that he used to justify using short girders instead of long trusses to build the bridge to stand up to the wind. This led to his design being picked over the Washington Department of Highways and cost about $5,000,000 less

Since it was expected to have light traffic, it was only a two lane bridge, which was very narrow compared to others at the time, so it was ended up being a very flexible bridge. And as I’m sure you can imagine, flexible is not really the adjective you want describing your bridge. It was said to move so much that people were getting seasick crossing it and eventually it was nicknamed “Galloping Girtie”. Eventually 40 mile an hour winds were able to break the cables and allow it to start twisting. Eventually the movement got so bad that the ropes tore and the deck broke and not long after the rest of the deck fell into the water. Fortunately, most everyone survived, though sadly a dog was lost.

Both of the accidents that Holloway talks about were investigated. In the case of the bridge, the Federal Works Agency picked three engineers to produce a report (one, Theodore von Kármán, would go on to be one of the founders of JPL).

The report stated

“the Tacoma Narrows Bridge was well designed and built to resist safely all static forces, including wind, usually considered in the design of similar structures. …It was not realized that the aerodynamic forces which had proven disastrous in the past to much lighter and shorter flexible suspension bridges would affect a structure of such magnitude as the Tacoma Narrows Bridge”

Holloway points out that they had indeed followed the modern techniques, but it happened that these techniques were flawed. One person, however, did see this coming Theodore L. Condron. He was actually an engineer who as advising the financing company on whether or not to approve the loan needed to construct the bridge. He was worried about the narrowness of the bridge, which is what ultimately caused the problem. He was so worried about it that he actually compared it to every other suspension bridge that had been built recently and pointed out that it’s width to length ratio was much narrower than any other bridge that had successfully been built.

He went to Berkeley to investigate some models and was essentially told that it would be okay. Of course, now we know it turns out that they did not account for deflection in both directions, but because he couldn’t disprove the deflection theory and he couldn’t find evidence to support his concerns, he eventually gave in. Even though we gave in, he’d still did suggest whiting and the bridge have 52 feet, a change that may have prevented the collapse.

Holloway goes on to point out some relevant lessons to draw some relevant lessons for us:

Lesson 1: Relying heavily on theory, without adequate confirming data, is unwise.

This bridge was the first actual test of deflection theory.

Lesson 2: Going well beyond existing experience is unwise.

He suggests that instead incremental steps should have been made specifically in narrowing the width. Small change size being a good thing is likely something familiar to you.

Lesson 3: In studying existing experience, more than just the recent past should be included.

It turns out that a professor who was studying suspension bridges and narrowly escaped the Tacoma Bridge collapsing while he was on it looked back on other bridge disasters that involved wind. Nine of the 10 that he found occurred before 1865.

University of Washington’s Professor Farquharson would later write it “came as such a shock to the engineering profession that it is surprising to most to learn that failure under the action of the wind was not without precedent”

Lesson 4: When safety is concerned, misgivings on the part of competent engineers should be given strong consideration, even if the engineers can not fully substantiate these misgivings.

This is supported by Condron’s concerns being correct, that the bridge design wouldn’t work.

Next, Holloway goes on to talk about Challenger. Again, I’m not going to talk too much about the details of the disaster here. Lots of places exist to document that very well. But as a reminder, the Challenger disaster occurred on January 28th, 1986 when 73 seconds after liftoff, the shuttle exploded during its 10th flight. This disaster was also investigated of course.

The report “concluded that the cause of the Challenger accident was the failure of the pressure seal in the aft field joint of the right Solid Rocket Motor.” Holloway explains the failure was due to a faulty design that was unacceptably sensitive to things like temperature, physical dimension, and reusability. Holloway tells us that this reinforces 3 of the 4 lessons that we learned from the bridge and an additional one.

Lesson 2: Going well beyond existing experience is unwise : The SRB joints, even though they were initially based on a solid design, deviated very far from their initial basis.

Lesson 3: In studying existing experience, more than just the recent past should be included.

Holloway specifically compares the attitudes around Challenger to that of the Apollo one fire, “The attitude of great confidence in accomplishments and the concern about meeting the planned schedules are especially apparent.”

Lesson 4: When safety is concerned, misgivings on the part of competent engineers should be given strong consideration, even if the engineers can not fully substantiate these misgivings. So much so in this case, that actually, the night before the launch, the engineers at the company who actually made some of the parts were against launching. And again these engineers who were against the launch, were not able to prove that it was unsafe. Holloway points out that the burden of proof on those who were for and against launch should not be equal.

Finally, he tells us that challenge or teaches us a new lesson, Lesson 5: Relying heavily on data, without an adequate explanatory theory, is unwise.

He specifically cites joints in the building of the solid rocket booster that originally were thought to become tighter during launch, but during tests for the first few milliseconds, right after they ignited, they actually moved away from each other. Several tests were done and the data apparently eventually satisfied everyone about this being okay, but there was no real understanding of why it was that these parts behaved differently than the way they thought when they designed them.

To close we’re left with three applications that he feels are very specific to software systems.

Application 1: The verification and validation of a software system should not be based on a single method, or a single style of methods. From lessons 1 and 5. “Every testing method has limitations; every formal method has limitations, too. Testers and formalists should be cooperating friends, not competing foes.”

Application 2: The tendency to embrace the latest fad should be overcome. From lesson 2. I’m looking at you Javascript (though this applies across languages).

Application 3: The introduction of software control into safety-critical systems should be done cautiously. From lessons 2 and 4. He’s not against using software and safety systems and actually finds that they can be successful, but suggests caution, warning: “Software can be used in safety-critical systems. But its use ought to be guided by successful past experiences, and not by ambitious future dreams.”