Welcome back! This week I’m featuring post incident reviews from a variety of different types of organizations.
A chance to learn from a meeting of researchers and tech companies coming together, I’ve read this through for the 2nd time recently, I strongly recommend it. One thing that stuck with me most recently is the drawing the line between types of surprise that responders experience, situational surprise and fundamental surprise.
David Woods again brings his systems safety experience, this time to NASA. He explores not just a specific incident, but also recommendations for the organization as a whole.
Review of the System Failure
That Led to the Tax Day Outage
Another opportunity to see how other teams and agencies do post-incident review. This time from the Office of the Treasury on their outage during Tax Day. A very relatable finding here, “While the response team’s substantial efforts allowed the IRS to resume tax processing operations the same day, improvements are needed to help mitigate or prevent outages.”
adaptivecapacitylabs/Resilience-Engineering-Resources
A suggestion from John Allspaw on things to read. I’m already starting to go through these, but if you want to “read ahead” here’s a chance. As he says in the README: “This is a collection of readings, talks, and other bits regarding the field of Resilience Engineering.”
Subscribe to Resilience Roundup
Subscribe to the newsletter.