The Role of Software in Spacecraft Accidents

The Role of Software in Spacecraft Accidents
This is a paper that Nancy Leveson wrote about a number of different losses in spaceflight throughout the 90s. She goes over the loss of the Ariane 5 launcher, the Mars Climate Orbiter, the Mars Polar Lander, loss of a Milstar satellite, and loss of contact with SOHO (SOlar Heliospheric Observatory).

I won’t cover the specifics of each accident here as I don’t believe that they provide much other illumination or learning for someone not in aerospace (or at least for me). Suffice to say that they all, but one, resulted in a complete loss of the given craft.

As the title suggests she goes through how each of these losses involved software. This is not a way to blame software, but a look at systemic problems and multiple causes that seem to have contributed in the face of the increasing use of software in these missions. Leveson’s ideas and analyses are specifically drawn from her experience and the public accident reports available. Because of this, some of the causes must be inferred or simply take at face value. Though Leveson highlights from the very beginning that she was in the business of identifying systemic causes not purely small technical ones.

She gives an overview of the accidents. Which is interesting background, but I felt there was much to be gained, even without knowing much of the details of any of the missions.

Leveson reinforces many ideas that we’ve covered before. Including things like: that for complex systems, in this case spacecraft, safety is an emergent property of the system, not a component level problem.

One thing that I drew from this was how I might want an accident report to read. How, often, these accident reports seem to stop at either human error or some cursory contributor, because that is provided as an explanation without digging further. For example, there are a few places for some missions where it says that certain forms of software testing didn’t occur, but there’s no word at all on why that is or if it was even investigated. Additionally, many of these accident reports call for more diligence or accountability which I think are nice ideas but are not really actions in and of themselves.

There are times where Leveson seems to take advantage of the benefits of hindsight, suggesting courses of actions that makes sense now that we know the outcomes. Although, many of the courses of action she suggests seem reasonable to me (as someone who isn’t in aerospace) regardless. Further, the limits of the individual accidents reports is very clear here, we’re left with the information and explanations provided. So often, I found myself saying “Ok…but why?”.

One of the lessons that comes through clearly is that software cannot be treated as hardware. The same approaches that helped make hardware components safe not only do not fit that of hardware, but my actually impede software safety.

An example she provides is redundancy. Where in hardware redundancy can be an obviously good idea, where the likelihood of_two_components failing at the same time or in the same way is especially rare, we cannot say the same about software. Often times redundancy may add complexity without adding much value to resilience or safety.

So what Leveson recommend to add safety?

Throughout the paper she continually advocates for what seems to me to be a DevOps approach to software safety. At one point even citing the problem of a world where development and operations are very far apart.

Her recommendations includes things like improved communication. Both in the existence of channels, but also in their availability to others. Also, testing must serve the purpose of validating real world usage, not simply a hurdle to jump or box to check, papers to sign off on. Testing should also take place simulating suboptimal conditions where possible.

Further, requirements should specify what the software must_not_do, not just a list of things it needs to accomplish. Without a list of things it must_not_do it can be difficult or impossible for a development team to ensure the behavior is safe.

Leveson explains that really, none of the software truly failed. In each accident case, the software executed, and in most cases it did so according to its specifications. But its behavior, in conjunction with other factors, lead to disastrous outcomes.

Many of the disastrous outcomes were attributable to “interactive complexity and tight coupling”.

Leveson points out that often, with software, we’re now able to build systems that have a level of complexity that surpasses our ability to control or understand.

As we’ve discussed before and as Leveson reinforces, in the case of “system accidents,” the system could not be made safer by addressing individual components.

On top of testing, Leveson also discusses problems she sees in management of these projects. In some cases, different teams may have observed a sign that something was amiss, but had no way of alerting the right person. In one case, some abnormalities were noticed before launch, but the only communication channels that were used (or perhaps available?) involved calling an individual who happened to be away and leaving a voicemail.

For these sorts of failures, Leveson suggests that there should be clearly defined areas of responsibility. Otherwise “diffusion of responsibility” can occur, where no one person or team is truly responsible for taking care of something.

There are cases where software was reused for very different forms of hardware and mission parameters. Without someone responsible for making sure there was a good fit between the hardware and software and without an expert that might know the problems with reusing it, “consistency” seems to have been accepted as a reason for the mismatch.

Additionally, some systems, including those Leveson covers, are built to a level of such complexity, that all possible interactions among components cannot be anticipated and thus cannot be preemptively guarded against.

“software itself is pure design and thus all errors are design errors and appropriate techniques for handling design errors must be used.”

Takeaways

  • Software often can contribute to a catastrophic outcome without truly “failing”
  • Knowing what software must_not_do is critical in designing safer software
  • Redundancy does not necessarily make software safer, and in some cases have the opposite effect.
  • Clearly defining roles can help prevent key responsibilities from falling through the cracks.
  • Software cannot be treated as hardware in order to gain safety
  • Software reuse, simply for the sake of reuse, or for “consistency,” can contribute to errors when the software doesn’t match the situation and/or hardware.
← Shift Changes, Updates, and the On-Call Architecture in Space Shuttle Mission Control
Learning From Organizational Incidents: Resilience Engineering for High-Risk Process Environments →

Subscribe to Resilience Roundup

Subscribe to the newsletter.