The FAA outage: On public incident reports and seeking second stories

By now, you’ve likely heard about the outage of the Notice to Air Missions (NOTAM) system that is used to deliver messages to pilots and the US-wide ground stop. If you’ve been following along you have read a few different accounts and “explanations” as the story developed.

I’ve gotten lots of requests to write about this (and other recent aviation incidents), though I typically don’t until an in-depth investigation has taken place, but due to the number of requests I wanted to address this one.

He we are a bit later, it may feel like the story has fully developed, that now we have all the facts. Now we understand what really happened.

Here’s the thing about public “incident reports,” especially for large, very public incidents like this one: its much more likely that your current understanding is not much more accurate than it was before. It’s also very likely that what you may be reading as an “incident report” isn’t one.

As Richard Cook explained in Being Bumpable, we should look beyond the first story to the second, those with more detail on how the different factors interacted and how that looked to those involved in the event.

Analysis of decision making in complex domains often assumes a sharp distinction between narrow technical features and other organizational and institutional issues. In action, however, decision making by experts…engages organizational and technical issues together and simultaneously

Even if all the factors that have been reported are 100% accurate, we’re still missing a lot of the context. So for these reasons, I’m going to say (and thus speculate) a bit less than you might imagine or see elsewhere.

I will say (hopefully a reminder) that it’s important to focus on the system as a whole, not one individual. Also, that we have the benefit (and obstacle) of hindsight. Its very easy for us to connect some dots and think such a view or understanding would be obvious, but that’s not how the event played out for those inside it.

One thing we often miss (whether inside an organization or out) is context. Different parts or contributors are going to seem surprising and/or important. Some will seem important purely because they are surprising.

When we see some surprising part of the system, that learning, about the way things actually work, is important, but that doesn’t necessarily make is an important factor to the event. This is part of why its important to have multiple, diverse viewpoints to help understand the event.

For example, there have been previous efforts in the last few years to change or modernize the NOTAM system that have failed. It had previously been described as running on “failing vintage hardware”(PDF). That may be surprising, but time will tell if and how much of a factor it may have been in this incident.

Though in the wake of incidents, often a window for change opens that was previously deemed organizationally unacceptable. It’ll be interesting to see what changes are made in the future in response and what new vulnerabilities those changes might create.

So as of the moment I’m writing this here is what’s “understood”:

  • This incident was triggered by an accidental file deletion.
  • The work being performed was to address an issue synchronizing a database with its backup.
  • The work was being performed by contractors.
  • Those who were involved in the incident have had their access suspended during the investigation.

Here are some questions that I would be curious to learn still:

  • Was the file operation something routine or something rare?
  • Was it seen as dangerous?
    • If so, by whom?
  • If this wasn’t something that felt routine or normal, how do operators learn about the options they have to manipulate the system?
  • Do they have a opportunity to explore or practice these things?
  • Have other similar events or near misses taken place (that perhaps weren’t written about)?
  • How was this detected and by whom first? What tools or processes did they use to help make sense of the situation?
  • Specifically what makes this situation difficult?

These are just a small sample of these sorts of questions I have and ones I imagine many readers have as well.

So how can we learn from public incident write ups? That depends a lot on what type of write-up it is. As John Allspaw points out, write-ups serve different purposes and are created for different audiences. We are just one audience and our purposes (learning from the event) is quite different from many of the other purposes that write-ups and reviews are created for.

You may recall that the notion of error, though usually unproductive in learning from the incident has other utility. So our first step is to try and determine the audience and purpose of what we’re reading.

Clues here include:

  • Time – how long has it been since the event? As a general rule, the faster after an event that an account comes forward, the less accurate it is likely to be and the more likely it is to be a first story.
  • Artifacts – does the report include anything that might have been used internally to understand the issue? This includes things like graphs, charts, logs, etc… The more of these you see the more likely it is that you can extract something meaningful.
  • Jargon – does the write-up preserve and explain internal codenames or expert jargon that was surely used during in the incident? This is understandably one of the more rare items, but if present provides a strong indicator that you’re getting (at least somewhat) of a peak at how things unfolded.
  • Narrative – does the write-up tell a story that acknowledges and addresses what it was like for the experts to be “inside the tunnel” of the incident or does it recount a simple connecting of the dots?

If the write-up fails these, its quite likely that reading it from a perspective of wanting to learn from the incident itself is unlikely to be fruitful. This can be the case for things like regulatory required notices, apologies to customers/shareholders, etc… That doesn’t mean that we can’t extract some useful information though. When faced with these we can ask similar questions (though we may get no answers) as to those I listed above:

  • Does this seem to resemble other incidents? Either those that we know about or those that seem to be eluded to?
  • Does it seem like some topics or areas are explicitly avoided or glossed over?
  • If it includes additional artifacts beyond text, do they appear to be produced just for this public report or were they perhaps used internally?

If you’re looking for examples of either type and want to see more public facing incident reports, I strongly recommend you check out the Verica Open Incident Database and their recently release report.

Takeaways

  • The NOTAM outage that led to a nationwide ground stop of air traffic is still not fully understand.
  • As the more investigation takes place, more details and accounts are likely to emerge, some of which will conflict what we thought we knew and accounts may even conflict with each other.
  • At this stage, none of the write-ups about this event are really “incident reports” that contain enough depth and detail that we can learn from.
  • Most public accounts of incidents serve different purposes and thus have different information in them than internal incident reports.
    • This means that the benefit we can extract from them from a learning perspective is potentially limited.
  • Many (or event most) public write-ups are going to be first stories, simplified “explanations” that obscure much of the detail and reality of the work. In order to effectively learn from an incident, we need to seek the second story, the story that has more detail about the event and the context it took place in.
    • With public incident write-ups, we may never get access to this second story.
← RR Podcast 1
Learning From the Real World - Examining Technical Work →

Subscribe to Resilience Roundup

Subscribe to the newsletter.