For this week, I have an article about on-call and how its done at NASA. Many of the conclusions here may not be that surprising to those who have been on-call for any length of time, but I think there is a lot to learn from how NASA makes the system work.
Shift Changes, Updates, and the On-Call Architecture in Space Shuttle Mission Control Emily Patterson and David Woods looked at how the shifts in NASA’s mission control for a space shuttle mission, synchronized with each other and how they brought in specific experts in an on-call type set up and got them up to speed.
This is pretty applicable to software, in that we often have the same sort of shift architecture, where experts will be brought in and expected to be brought up to speed very quickly.
The paper looks at how they function from two perspectives. One, since the mission control center is staffed 24⁄7 across three shifts, how is it that incoming controllers are able to update themselves with what’s going on when they return to a shift? Two, during abnormal events, how can the needed experts be brought in and brought up to speed quickly?
In the case of routine operations where it is time for a new controller to come on shift, an hour overlap is dedicated to this handoff. I think that this is important and something often lacking in software organizations. Here the organization clearly acknowledges that there is a significant cognitive and time burden in catching up with everything that is happened. That’s not something I see a lot in software teams and organizations, where there seems to be an underlying assumption that not a lot of time should be dedicated to it. Obviously this depends on the organization and can vary based on production pressure.
Once a shift change starts, updates go from the bottom up throughout the mission control organization. First the incoming and outgoing controllers sit next to each other and will have a conversation. Unlike many other conversations that happen in mission control, this conversation takes place face to face and is not on a voice loop (See more about voice loops in TK). At the same time, the “back room” staff are going through the same thing.
Once they are done with this, both the incoming front room and back room controllers will talk to each other on a voice loop updating each other. This allows them to make sure that they’ve understood the same things and have shared priorities. This takes place on a voice loop where anyone can listen including the outgoing staff so they are also providing a check here.
Following that, the controllers then are able to update the incoming flight director on a special voice loop providing an additional check.
I mentioned earlier that there is an hour window dedicated to this handoff. That doesn’t mean that the conversation takes an hour, it’s an hour for all processes to complete. So for example, the face-to-face conversation of incoming and outgoing controller May only be about 10 minutes. Additionally, this conversation is seen as very important, to such a degree that it is culturally unacceptable to interrupt it.
Before even beginning a conversation with the controller who is already there, the incoming controller will look through the logbook and available mission data logs. This helps them update their mental model and be better able to predict what subjects are important to talk about. This has several advantages, including lowering the cognitive burden of the outgoing controller.
I found this really interesting, in most organizations that I’ve encountered, handoffs are typically a high burden for the person with the experience, the outgoing person, and often is just an info dump for the incoming person.
In this setup, this is flipped. It is up to the incoming person to brush up on the current state of the mission and read the logs. But also, when the authors looked at who is asking questions and who was speaking for more of the time, they found that it was the incoming controller, the new person, who was doing the question asking. This allows more than just getting question answered, it also provides a signal to the person who is already there whether or not this incoming person is well calibrated to the current concerns. Further, this creates a space where the missed calibration can be detected. For instance ,if an incoming controller asks many questions that seem unimportant to the outgoing controller, it indicates a mismatch in where they believe they should focus. The questions also provided an opportunity to bring in other perspectives. An incoming controller might ask for more detail about what lead to a decision being made or ask about other courses of actions.
On average, the controllers did not engage in a briefing until about 20 minutes after that incoming controller had arrived. Additionally, when they did have a briefing it was usually relatively fast, about 10-15 minutes or so.
So what did they do with the rest of the time? They would come in, sit next to the person who they would be replacing on shift and begin to listen to the voice loops that were available. They would also look at the data that was on screen and read through previous logs and any other documentation that was available.
It’s the existence of these logs that further allow this system to work. A controller who was interviewed one as far as to say, “we couldn’t function without logs”. Without this historical record that is able to be studied, the burden on the outgoing controller would likely be much higher and the potential for things to be missed higher still. I think this is another lesson that we can learn in software. Often times our logs are very noisy and often seen as only a reference when it comes to debugging, not something that would be read and reviewed in normal cases. But in many teams, there is room to create a similar “flight log” about what is going on that might be better able to be read by humans and to get an idea of the general state of the team, mission, or system. In some cases, chat log scrollback is already serving this purpose
When the authors examined what the controllers actually talked about, they found that the updates that were observed between the incoming and outgoing controllers were not about specific data points. Instead, they would discuss activities that occurred decisions that were made and their view or analysis of the data.
A lesson that we can learn from this, is that this system seems to work only because of the effort put in during slower periods by the incoming controllers to develop shared understanding. If there was not purposeful effort made to create this shared understanding by which they could ask these different types of questions, then most of this system would likely break down.
The shared understanding allowed the updates to primarily focus on deviations from what was expected or from what was planned previously. And how in-depth these briefings were varied based on how far the events were missions that deviated from the expected plan. Again without the shared understanding this wouldn’t be possible.
This points to what many of us have experienced during our on-call shifts, effective on-call does not mean sitting comfortably at home hoping that nothing goes wrong. Many of us, if we are aware that a high consequence processes change is already going on, will put effort into staying abreast of it in the event that they are needed.
I want to be clear here, I’m not advocating that those who are on-call necessarily spend more time on top of a full work week to stay up to date. Instead I am suggesting that software organizations might be more effective if they understood that being on call took time investment during these periods. Obviously, in many cases pressure to maximize production at minimal staffing levels leads to a situation where this is not done and perhaps quite the opposite is done.
The authors also cover a case where an incoming controller was not briefed about a change that had taken place essentially in private. Previously, they had a plan to not close vent doors, but a private phone call ended up reversing the decision. So when the need and order came down, that controller was surprised that this was the case and was unprepared for it.
This is an important note in the paper because, as the authors point out, it tells us that coping with the added workload of bringing someone up to speed by postponing updates will actually cause further breakdowns later. So it may feel like an effective strategy in the moment, in that someone is not having to do updates then and are instead perhaps focusing on the “real work,” but in the longer term they are more likely to fail.
NASA Johnson Space Center has already implemented this organizational solution during missions where the staffing is reduced unless a problem occurs. They have made being on-call an official responsibility that requires investment, although less than if all the controllers are continuously staffed.
- Handovers take time
- Getting people up to speed takes purposeful effort, which can be made easier with good systems, but not eliminated.
- Reducing surprise through effective handoffs can help responders be more prepared
- While having staff on-call can reduce staffing needs, they still need effort and investment
**Who are you? ** I’m Thai Wood and I help teams build better software and systems
Want to work together? You can learn more about working with me here: https://ThaiWood.IO/consulting
Can you help me with this [question, paper, code, architecture, system]? I don’t have all of the answers, but I have some! Hit reply and I’ll get right back to you as soon as I can.
**Did someone awesome share this with you? ** That’s great! You can sign up yourself here: https://ResilienceRoundup.com
Want to send me a note? My postal address is 304 S. Jones Blvd #2292, Las Vegas, NV 89107