Software Challenges in Achieving Space Safety

Welcome to the first issue of the new Resilience Roundup! If you haven’t already you’ll see a few other emails that’ll help you login to the new site. This is where you’ll be able to get access to the new stuff (some still in development) like the podcast and a group discussion call for any questions you might have.


Software Challenges in Achieving Space Safety

This is a paper by Nancy Leveson that is a follow up on her last look at spacecraft and software, The role of software in spacecraft accidents. If you haven’t read the first one, I’d suggest you check it out first.

Hardware reliability techniques (such as redundancy) don’t work on systems that are made up of large parts of software. Leveson (using the same spacecraft accidents as before) cites the Ariane 5 where problems were addressed with redundancy, but that introduced complexity, where an issue switching to the redundant system contributed to failure.

Another approach from hardware that is sometimes used in software is having different teams design the same thing, but work separately. The idea here is that its unlikely that both teams will create a component with the same vulnerabilities. But with software this isn’t very likely either. Software designed from the same requirements by different people are likely to have common failure modes or common design errors.

Leveson echoes the idea that in simple systems, you can decompose them, check their individual components for safety and be fairly sure the composite system is safe.

But in complex systems, where not all interactions can be predicted, this isn’t the case. Leveson calls this a “component interaction accident,” citing the Mars Polar Lander as an example.

Each piece of the system did what it was “supposed to” do. The software specs said that if a certain type of vibration was received to interpret that as the lander touching down. The deployment of the landing legs was interpreted as such a signal, which lead to the engine shutting down early and the lander impacting the surface.

Sometimes its suggested that complete model be made of the software and its states and then that should be analyzed for safety, but that doesn’t really work either. Making a complete model of how software functions is also unlikely to be reasonable. For example, Leveson built a model for the FAA of the Traffic Collision Avoidance System (TCAS) software specification and estimated that it had 10^49 possible states it could be in. And that software is less complex that whats in most spacecraft.

Specifications and Justifications

In the Mars Polar Lander case, Leveson points out that the docs didn’t specify the failure modes that a given requirement was protecting against. In the same vein as specifying what the software must not do, giving the people writing the code a clearer picture of why something is being requested is pretty much always a good idea.

Much of the suggestions are around specs and process. For example, the author mentions the Ariane 501 accident report recommendations that treating documentation of justifications of designs as important as code. They should also be kept consistent with the code, that is changed at the same time. I think this is a great idea, but in practice this can be really hard to implement. Code can drift, what “justification” do you need for fixing a bug for example. Eventually small changes add up. I’m not saying you shouldn’t try this out, I just think its not as straight forward as the Leveson or the accident report seem to imply

Solutions

Leveson offers up some ideas for solutions as well as processes to move forward. If “safety analysis” is used during early parts of the design phase, the cost doesn’t rise. This is not the case if safety is an afterthought, it must be designed in concert with much of the architectural decisions.

Leveson cites estimates that 70-80% of decisions that affect safety are made in the early stages of a projects design.

Leveson suggests STAMP (Systems-Theoretic Accident Model and Process) as a solution. While I don’t agree that it’ll solve all the problems I’ve laid out, I do think developing new processes (or less new by now) and approaches is a good idea.

In the STAMP view (based on control theory and system theory), accidents are viewed as a control problem, a lack of constraint enforcement. Thus, safety is created by constraints on behavior of components.

For example, in the Mars Polar Lander case, you’d say that the safety constraint that was violated was the ship impacting the planet above a certain velocity.

Hazard analysis, then, is based off of control diagrams. As the system is developed, STPA (System-Theoretic Process Analysis) is applied.

In this view, there are only two general ways that safety constraints can fail (or not be enforced).

  1. The controller issues inadequate or inappropriate control actions, including inadequate handling of failures or disturbances in the physical process or
  1. The control actions are inadequately executed by the actuators.

You can of course then refine those points, down to something like:

  1. The controller issues an unsafe control action.
  1. The controller does not issue a control action needed to enforce safety
  2. The controller issues the necessary control action but not at the right time (too early or
    too late).
  3. The control action is stopped too soon or too late.

This might help for understanding a control diagram and for implementing some of the desired features into code, but I’m certainly not convinced that this is going to suddenly make software safe.

The upside of this form of analysis is that at least it considers more than individual system components and reveals the reality, that system level interactions can lead to failure. It was also been adopted by the US Missile Defense Agency where it helped reveal more hazards than other methods. Again, this is a step in the right direction, but its not the be all end all.

Takeaways

  • Traditional hardware engineering approaches don’t work for systems that have large software components.
  • Specifications about the behavior of software should specify what the software must not do, along with what it should.
  • Safety analysis should be brought into the design process as soon as possible.
  • Modeling all possible states of software is also not feasible in large software systems.
  • Other analysis methods, such as STAMP may help some of these issues by helping to reveal further hazards.
← Communication Strategies from High-reliability Organizations: Translation is Hard Work
Got Infrastructure? How Standards, Categories and Other Aspects of Infrastructure Influence Communication →

Subscribe to Resilience Roundup

Subscribe to the newsletter.