⭐Trade-offs Under Pressure: Heuristics and Observations of Teams Resolving Internet Service Outages

This is the largest work I’ve analyzed yet and also the one most closely connected to software. If you’re interested in getting started or learning more about how incidents can be analyzed, this is a great place to start.


Trade-offs Under Pressure: Heuristics and Observations of Teams Resolving Internet Service Outages

Trade-offs Under Pressure: Heuristics and Observations of Teams Resolving Internet Service Outages

This is John Allspaw’s master thesis. It focuses on the cognitive difficulties of responding to incidents when robustness features fail. There has been a lot of research and how to and guides written about how to make things more robust, but very little looks at what happens after that? What happens when those things don’t work? What happens to the people? How can we support them.

The paper as a whole asks the questions “How do engineers think about the outage they’re facing?” and “What makes them good at resolving issues when they do not have a set procedure for the scenarios they find themselves in?”

Since this is a master’s thesis, it goes over previous work, some background and then the research that was conducted. I’ll go over some of the background that is useful for us to know in software and then talk about the research and its findings.

Unlike some papers, there’s a lot we can learn both from how the research was done as well as the findings itself. In this case, you might even find the research process itself more applicable if you’re currently trying to learn more for incidents or otherwise investigate them.

Historically the question asked of much of existing research is “What is needed for the design of systems that prevents or limits catastrophic failure?” I think we all recognize this and can name a number of books/sub-disciplines/approaches that attempt to answer this question.

But in contrast this research asks (as does resilience engineering):

“When our preventative designs fail us, what are ways that teams of operators successfully resolve those catastrophes.”

Ultimately Alspaw tells us the thesis question is:

How do judgement heuristics influence Internet engineers’ team coordination to resolve escalating large-scale outage scenarios?

The approach

The research uses case studies, based on process tracing as well as “semi-structured interviews.” They used cued recall in the interviews, which is where they show the person being interviewed something like a line from the IRC log and then ask them about it.

They do note that one downside to this approach is that you have to avoid “justification bias” where people may change their explanation of why they did something. In this case they avoided this by not telling them in advance that they’d be asking about this stuff so that people couldn’t pre-formulate reasons (inadvertently or otherwise).

So they go through and review the logs with the person and then ask them open ended questions at the relevant points they’re curious about. If you’ve seen other cued recall type interviews before, they may have involved video which is the most common, but for this they used what they had, which I recommend you do too. They had IRC logs and syslog and the access logs of the graphs, so they used that.

They interviewed 8 engineers 6 weeks after the event. On average, these interviews lasted 51 minutes and the audio was recorded and transcribed by a service and then proofread. The 8 engineers were split between 5 infrastructure engineers and 3 product team engineers with various tenure levels and career experience.

During the interviews they focused on a timeline of the events and the dashboard logs.

The event

This case study focuses on a single incident that starts December 4, 2014 at 1:06 PM Eastern Time. At that time, it was noticed that the Etsy homepage for users who were signed in was generic instead of the personalized one as usual. This was a fallback for when the personalization had failed for some reason.

At this moment the homepage and personalization team were in in a conference room for a lunch and learn and were able to notice in about 6 minutes that the feature wasn’t working.

This is where the chat logs start, first in their #sysop channel to start diagnosing, the conversation later gets moved over to #warroom where hypothesis begin to emerge and graphs begin to be shared.

1:18 PM

We’re now about 12 minutes in where its noticed that the sidebar API is experiencing an API spike. At this point its suggested that the sidebar gets turned off as they’re also noticing 400 responses from the side bar. This stops once its turned off.

But this is sort of confusing, where did those errors come from and what does it have to do with the personalization?

1:28 PM

Now, 22 minutes in, a responder notices that the gift card API call errors match the graph pattern of sidebar errors.

1:32 PM

27 minutes in its noticed that the API request for data about a single shop is causing the errors because the given shop ID doesn’t exist. Upon investigation its revealed that the shop used to exist but is “closed” and the user ID associated with the shop is an Etsy employee.

1:38 PM

6 minutes later, now about 33 minutes in, its then discovered that the shop was linked on the blog, the Etsy employee who owned the shop posted an article to the blog around the time the errors started. They get someone from the content team to unpublish the post.

At this point a hypothesis is generated for this sequence of events that goes something like this:

  1. The post is published.
  2. The API populates the sidebar (which includes the shop from the blog post).
  3. The call for the shop data returns a 400 since the shop is closed.
  4. Because there is a 400 response, it doesn’t get cached, and thus the request must go to the database.
  5. This happens for every single user who is currently signed in visiting the homepage
  6. This causes requests to queue, which would normally be fine.
  7. But the latency eventually reaches 3.5 seconds, which triggers a timeout.
  8. Thus a generic fallback is shown.

The team leaves the “more from the blog” module turned off overnight just to be sure this is the case and also makes some changes to cache 400 responses.

Allspaw gives a caveat here about constructing narratives, and I shall too as I have one step further removed constructed a narrative for you. There is much more detail in the paper, but much of it is to help explain to you what 400 errors mean and how caching matters and the role of CDNs in serving traffic, not exactly what we’re focusing on here.

Doing the analysis

As I said in the beginning, much of what we can learn from this paper goes beyond just the case study or the conclusions about heuristics (though those are useful too), but also some insight into how such an analysis could be performed.

I’ve spoken to a lot of folks and a fairly common problem I hear about is how difficult it is to move from not doing any sort of real analysis to having done one. The usual advice I give is to start small and I’ll echo that here, though we’re using this approach as a bit of a case study ourselves, that doesn’t mean we have to start with all these elements, just that they provide a cohesive picture of what incident analysis can look like and what some of the phases might be.

Also remember that this was done for a master’s thesis, few if any of us at tech like companies now will be held to such a standard, nor is such a standard needed to yield insight.

Parsing out “episodes”

So in order to investigate the heuristics used at various parts or stages of the incident, the events were broken down into “episodes” that could then be examined and categorized.

These were done by looking through the IRC data and then validating it with interviews. You’ll see this pattern a lot in this sort of incident analysis, look through the data, categorize or notice somethings, then validate it with the people who were there. Its a good pattern and one that you can follow too.

Ultimately 3 categories were found:

  • Coordinative activities
    • “supporting or sequencing tasks in order to produce an outcome”
  • Diagnostic activities
    • “hypothesis generation and relayed observations”
  • Disturbance management activities
    • “state of response to the event was discussed or requested”

The examining of the chat data and logs focused on 8 of twenty engineers that responded to the incident, because 12 of the engineers had questions whose conversation didn’t seem to significantly influence anything they were trying to study.

Notice that there were 20 people in the channel, but they only looked at the ones that seemed to represent or reveal things about what they wanted to know. Of course, they had to read it to figure that out, but you can do the same thing! Just because there are 20 people in an incident channel doesn’t mean you have to interview all of them!

Looking for heuristics

At this stage, the focus was on the diagnostic and coordination categories because there was lots of data, they had to constrain their search. Again, you can do the same thing, the data was categorized, not for its own sake, but it helped decide where to focus. Just because there is data there doesn’t mean you have to use it in every analysis. Also, you could later do some other analysis and perhaps explore different questions.

The point here is that your resources are finite, your time, energy, focus, as is that of the responders. Its not just OK, but necessary to strike a balance in order to perform a successful analysis. You could spend forever doing an analysis, but if you never finish how can those insights ever be used?

From this filtering, 5 coordinative episodes were found and analyzed. They were all focused on making changes in prod to try and fix the issue itself or limit the damage it was causing. I won’t go into the specific “episodes” that were found because I don’t think its needed to understand the results or to get an idea of the sort of insights you can gain from analysis, but if you’re interested there are lots of graphs and more detailed descriptions in the paper itself.

The heuristics found

4 heuristics were found, 3 diagnostic, 1 coordinative. When faced with an anomaly:

  1. First, look for a recent change.

This was the most used rule of thumb that was found, but interestingly it was also the one that was the least explicit when looking through the chat logs. But it was very consistently mentioned by interview participants, everyone said that their first instince before gathering any other diagnostic data at all was to look if a change had been made. I think this probably an unsurprising result to many of us, which is what I think makes it so interesting, that this process could reveal this sort of bedrock guideline that many of use that the engineers didn’t talk much about in IRC.

  1. If nothing correlates with a change, widen the search.

This is where the engineers said they’d widen their search and also, where uncertainity was high.

“I don’t know. That means, I don’t know, you’re just starting from nothing. You have no… it’s harder to have any kind of working theory.”

  1. Look for patterns that match (or don’t match) previous incidents and causes so you can rule them in or out.
    • Or do the same thing, but focus on recent events.

This is called “convergent searching”. I think most of us have experienced this as well, even if we haven’t realized we’re using this method. Something about an incident will seem very similar to one we responded to, perhaps long ago, and it’ll guide our diagnosis.

  1. Don’t wait for tests to pass during outages.

Unlike other periods of work, during this event, code was deployed based on consesus and was “push[ed] on through.” I think this is one of those points where many of us sort of understand this even if we weren’t expecting it, but from the analysis, it wasn’t really clear why this was.

It’s a bit of a pardox too on the surface, given that new code changes might cause other problems, you might expect that folks would be sure to wait for automated tests to pass. On the other hand, as the author suggests, perhaps there is a time pressure that is causing it?

In order to help figure out what was happenning a small survey was sent out to all the Etsy engineers:

1. Before I deploy ANYTHING during an outage/degradation scenario, I will ask others if I should wait for tests to finish, or push on through. (yes/no)

2. When doing a CONFIG deploy during an outage/degradation scenario, on a scale of 1-5, where:
    a. 1=I NEVER wait for tests to finish, I rely on feedback from others on my code change before deploying.
    b. 5=I ALWAYS wait for tests to finish, I don't care how much time pressure there is.

Of the 32 engineers who answered the survey, 29 of them said yes they ask others if they should wait and more people rated their behavior as a 2 on the 1-5 scale.

They also followed up with one of the survey respondres to give more context around the decision.

I don’t want to focus too much on the decisions this data represents, but mostly on how there were lots of ways at answering questions about what was seen. They had a chat logs that hinted at this heuristic, then they sent a tiny survey, then they followed up with one person.

It might be too much in your organization’s current process to do all of that (or you may not have time!), but I think in many cases some of the things could be done to gain some insight.

Once you know more about how people are working, including things like what heuristics they might be using, you can then turn around and examine the tools they have available. Are those tools supporting that work?

For example, with heuristic 1, looking for a recent change, the graphs they use internally drew a vertical line over a graph for deploys. So in that case, yes, it was. But in a case where it wasn’t, you know have a good place to start to improve the tooling.

Takeaways

  • Insights can be surfaced from incident analysis through a variety of methods like analyzing chat messages, interviews, surveys or combinations of tools and approaches.
    • These findings can then be used to evaluate supporting tools and processes for improvements.
  • Activity that limits the extent of the disturbance was a big part of the response, as opposed to acting on that disturbance directly.
    • This makes sense when we consider than in the face of anomalies, how to act on the disturbance itself is likely unclear.
  • Instead of focussing on how to build systems that prevent failure, this research analyzes how operators successfully navigate it and restore services.
  • 4 heuristics were discovered in how the engineers responded to anomalies:
    1. First, look for a recent change
    2. If you found nothing before, widen the search.
    3. Look for patterns that match previous incidents or causes and rule them out.
    4. During an incident, rely on other engineers consensus on when to push changes as opposed to waiting for tests to pass.
  • This research presents an in depth case study style analysis of a single incident, while that’s valuable its not the only way to do an analysis, nor does it mean that all analyses need to be this rigorous to produce useful and actionable information.
← How Not to Have to Navigate Through Too Many Displays
Communication Strategies from High-reliability Organizations: Translation is Hard Work →

Subscribe to Resilience Roundup

Subscribe to the newsletter.