I’m one of the operators of Wonderland, Jimdo’s in-house PaaS for microservices.
Two weeks ago, on September 5, I did something embarrassing at work.
We were debugging a broken deployment of our central API service. This API is nothing less than the entry point for managing all container-based services running on our platform, including most of our own system services (by virtue of dogfooding).
In an attempt to fix the problem we were experiencing – our API service failed to scale to a certain number of replicas – I deleted what I believed to be a duplicate instance of the corresponding ECS service in the AWS Management Console…
That turned out to be a mistake.
Instead of performing this action in our AWS staging account, as I intended to do, I accidentally deleted the ECS service in our production account. Worse, I did not delete some duplicate; it was the real thing.
To make a long story short, this blunder caused our API to be down for 31 minutes, mainly because it took us very long to figure out how to redeploy the broken API service.
Guess what I did immediately after resolving the incident and telling our users the good news?
I started writing a postmortem. Not because I had to, but because I know that postmortems are the ultimate tool to learn from incidents.
So, what is a postmortem?
A postmortem is a written record of an incident. Among other things, it documents the incident’s impact, what caused it, the actions taken to mitigate or fix it, and how to prevent it from happening again.
I will tell you more about the ingredients of a good postmortem later in this article. For now, I want you to understand that fixing the underlying issue(s) of an incident is important – but not enough. We also need a formalized process to learn from these incidents.
That’s what postmortems are for. Postmortems help us understand why incidents happen and how we might better prepare our systems for the future. We share this knowledge with other teams or, better yet, make postmortems public so that more people can benefit from them – and see that we actually care.
Failure isn’t a disaster, it’s a learning opportunity.
Similar to Chaos Engineering, conducting postmortems requires a fundamental shift in the mindset of managers and employees, if not whole companies. And like Chaos Engineering, postmortems have the potential to make a company more resilient as a whole.
We need to embrace failure for postmortems to work. This includes having blameless postmortems and no finger-pointing when looking for root causes. (By the way, human error is never a root cause, but more on that in the second part of this article.)
When to write a postmortem? That’s up to you. A good rule of thumb is to write one if an incident has an immediate impact on users (downtime or data loss) or if a stakeholder asks for it. Putting together a decent postmortem often takes a couple of hours, but the effort is usually worth it! Just remember to start the work as soon as possible, with events still fresh in mind.
Inspired by another infrastructure team at Jimdo, we started using the Example Postmortem from the SRE book as a template for the postmortems we do for Wonderland.
If you follow the link, you’ll find a Markdown version of said template. I created it from the PDF book for two reasons: First, I wanted to share our postmortems as part of our standard Wonderland documentation, so that our users (other Jimdo teams) can easily find them. Second, all of our development is based on GitHub and we’re used to writing and reviewing Markdown files. In other words, my goal was to reduce barriers to reading, writing, and publishing our postmortems.
As far as I can tell, this initiative was a success. I’ve been astonished how quickly my teammates have adopted the new template. Ever since I published the first postmortem in this manner, they’ve been eager to do the same after a new incident has occurred. I’m aware that this is, for the most part, a matter of having the right mindset. However, it certainly doesn’t hurt to make the process more pleasant for everyone involved.
Publishing more postmortems ultimately means being more transparent about failures, which in turn builds trust in our team and makes our platform more reliable. We’re an internal service provider after all.
Your first postmortem
Now I encourage you to use the template as a foundation for your next – or perhaps first – postmortem. Give it a read. If there wasn’t an incident in last few days (I hope so!), think of the last time you had to deal with an outage. Then go through the different sections and try to fill in the blanks. Be consistent. Use the active voice throughout the document. Settle on a time zone and format. As with most templates, feel free to customize it to your own needs.
Here’s a summary of what each section in the template postmortem is about, including short examples:
- Title – Name of the postmortem, e.g. “Deployer API Outage Postmortem”
- Date – When the incident happened, e.g. “2016-09-05”
- Authors – List of people who wrote the postmortem (GitHub handles work fine)
- Status – Current status of the postmortem, e.g. “Complete, action items in progress”
- Summary – A one-sentence summary of the incident, usually something like “Service X was down for N minutes due to Y”
- Impact – The incident’s impact on customers and, if known, revenue or reputation, e.g. “Users were unable to do X while service Y was unavailable from 09:29 to 10:00 UTC”
- Root Causes – A list of causes that have contributed to the incident (I’ll cover root causes in part 2 of this article)
- Trigger – What triggered the outage? e.g. “Merging pull request X which started the rollout of broken software Y”
- Resolution – The action(s) that mitigated and resolved the outage, e.g. “Disabling feature X helped to mitigate the problem. Rolling back to version Y resolved it.”
- Detection – How the problem was noticed, e.g. “Pingdom detected that service X was down and paged on-call via PagerDuty”
- Actions Items – A list of actions taken (with links to GitHub issues) to mitigate or resolve the incident, and to prevent it from recurring
- Lessons Learned – What went well? What went wrong? And what was sheer luck?
- Timeline – A detailed timeline of the events related to the incident
- Supporting Information – Additional graphs, screenshots, command output, etc.
Postmortems are a collaborative effort thriving on feedback. Make sure to share first drafts internally with your team. Once the review is complete, share the postmortem with as many people as possible.
Always keep in mind: Failure isn’t a disaster, it’s a learning opportunity for the whole company.
That’s the end of part 1. In part 2, I’m going to dive into the wonderful world of root causes, probably the most important – and most difficult – element in conducting a postmortem.