On Finding Root Causes
In my previous article I introduced you to postmortems – what they are, why you should conduct them, and how to get started writing your own postmortem documents.
As a quick recap, a postmortem is a written record of an incident documenting its impact, what caused it, the actions taken to mitigate or fix it, and how to prevent it from happening again. In a broader sense, postmortems are a great tool for a company or organization to learn from failure.
One of the big questions a postmortem has to address is: What has caused the incident? – What’s the reason the system failed the way it did?
At first glance, finding the root cause – the initiating cause that led to an outage or degradation in performance – seems to be the rational thing to do. For system owners, knowing who or what is responsible for an incident appears to be a desirable goal. Otherwise, how else should they implement appropriate countermeasures?
In reality, however, trying to attribute an incident to a root cause in hindsight is not only impossible – it is fundamentally wrong.
This and most of the lessons that follow have their origin in Richard Cook’s paper “How Complex Systems Fail”. I already devoted a two-part series to his seminal work, but there’s so much more to learn from it, especially when it comes to postmortems. I think you’ll agree.
There is no single root cause
In complex systems, such as web systems, there is no root cause. Single point failures alone are not enough to trigger an incident. Instead, incidents require multiple contributors, each necessary but only jointly sufficient. It is the combination of these causes – often small and innocuous failures – that is the prerequisite for an incident.
As a consequence, we can’t isolate a single root cause.
One reason we tend to look for a single, simple cause of an outcome is because the failure is too complex to keep it in our head. Thus we oversimplify without really understanding the failure’s nature and then blame particular, local forces or events for outcomes.
One of the things I like about the postmortem template I mentioned last time is that it says “Root Causes”, not “Root Cause”. For me, that’s a testament to the fact that you need to look deeper if you only have a single root cause.
But even “Root Causes” might not be the best term, as Andy Fleener has pointed out to me on Twitter:
I definitely prefer “Contributing Conditions” over Root Causes though. Cause can only be constructed with the benefit of hindsight
That’s a good point that made me consider modifying our own postmortem template as well.
Hidden biases in our thinking
Hindsight bias continues to be the main obstacle to incident investigation. This cognitive bias, also known as the knew-it-all-along effect, describes the tendency of people to overestimate their ability to have predicted an event, despite the lack of objective evidence.
Indeed, hindsight bias makes it impossible to accurately assess human performance after an incident.
A similar but different cognitive error is outcome bias, which refers to the tendency to judge a decision by its eventual outcome. It’s important to understand that every outcome – successful or not – is the result of a gamble. The overall complexity of our web systems always poses unknowns. We can’t eliminate uncertainty.
After an incident has occurred, a postmortem might find that the system has a history of “almost incidents” and that operators should have recognized the degradation in system performance before it was too late. That’s an oversimplified view though. System operations are dynamic. Failing components and human beings are being replaced all the time. Attribution is not that simple.
We therefore need to be cautious of hindsight bias and its friends, and never ignore other driving forces, especially production pressure, when looking for root causes after an incident has occurred.
Human error is never a root cause
It’s the easiest thing in the world to point the finger at others when things go wrong. And unfortunately, many companies still blame people for mistakes when they should really blame – and fix – their broken processes.
Blameless postmortems only work if we assume that everyone involved in an incident had good intentions. This ties in with the Retrospective Prime Directive (a postmortem is a special form of a retrospective), which says:
Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.
Human error is NOT a root cause.
We should rather look for flaws in systems and processes – the causes contributing to failure – and implement measures so that the same issues don’t happen again. It requires systems thinking, which focuses on cyclical rather than linear cause and effect, to view the system as a whole in order to find out how it drifted into failure – both at a technical and organizational level.
Here’s one of my favorite passages on the topic, taken from the SRE book:
When postmortems shift from allocating blame to investigating the systematic reasons why an individual or team had incomplete or incorrect information, effective prevention plans can be put in place. You can’t “fix” people, but you can fix systems and processes to better support people making the right choices when designing and maintaining complex systems.
Does this mean that operators are off the hook? No, not at all. They’re the ones with the most knowledge surrounding the incident. For example, they know first-hand how the system failed in surprising ways. Hence, they’re responsible for finding ways to make the system more resilient – including writing a postmortem.
Practical example
Let’s wrap this up with an example from an actual postmortem.
Last time I told you a story about a recent outage at Jimdo. In a nutshell: To fix a broken deployment of our API service, I wanted to delete the corresponding ECS service in our AWS staging account. Unfortunately, I actually removed the service in our production account, causing our API to be down for half an hour. Oops!
In the postmortem that followed, we identified two root causes:
- The tool we’re using to log into AWS accounts made it difficult to figure out in which account one operates. This contributed to deleting the ECS service in the wrong account. (We subsequently improved the tool to show more helpful environment information.)
- The component that deploys services to our PaaS didn’t notice that the underlying ECS service was gone and still tried to update the (non-existing) service – which failed. This made it impossible to re-deploy the API service without manually removing all remaining pieces of the original service. (We’ve since fixed the operation to be idempotent.)
I’m sure that if we had looked closer, we would have found more root causes contributing to the failure, but we stopped here, made our homework, and moved on.
One more time: You can’t fix people, but you can fix systems and processes to better support them.
Keep this in mind when you write your next postmortem.