5 minute read

Yin and yang  – failure and success.

The other day, when I was listening to the Beats, Rye & Types podcast, I noticed this sharp statement, which I had to jot down immediately:

With the traditional methods of dealing with failure… you can get to a certain threshold of safety – and you hit sort of a plateau beyond which you cannot go. In order for you to go beyond that, you have to start approaching safety in this new way, basically focussing on learning, removing blame, building healthy organizations.

It was no other than Dave Zwieback, the author of Beyond Blame: Learning From Failure and Success, who said this on the show. He’s one of the best-known proponents of using blameless postmortems – or learning reviews, as he prefers to call them – to address fragility within complex systems and organizations. The interview with Dave made me want to read his book, which turned out to be an excellent decision. Despite being a short read (69 pages of content), it’s a masterpiece packed with valuable insights – some of them entirely new to me.

Learning through postmortems

Learning from the past is very difficult, yet it’s precisely what we need to do for our organizations to succeed. If we look a bit more closely, there’s no shortage of opportunities to learn from the past – server outages are a prime example – provided that we stop the finger-pointing and “go beyond blame and punishment”, as Dave puts it.

With the perfect vision afforded by hindsight, we can spend a lot of time ruminating on what we could or should have done. That is counterproductive; the past is past. We need to acknowledge and learn from our mistakes, and move forward, focusing instead on what we will do now and in the future.

Fixing the technical issues that manifest during outages is important, but not sufficient. We also need a structured process to truly learn from these events. That’s what postmortems are for. They help us understand why incidents happen and how to prepare our company – “outages are symptoms of trouble somewhere deeper in our organization” – and the systems we run for the future.

Indeed, I can say without exaggeration that conducting postmortems has been one of the most rewarding – though initially uncomfortable – experiences of my engineering career. Postmortem conversations can be tense when stakes are high. It’s also hard work to “mentally transport ourselves to the past” while we’re under the influence of cognitive biases (we always are). Hindsight bias, in particular, remains an obstacle to incident investigation, making it impossible to assess human performance accurately after the fact.

“If only Mike didn’t troubleshoot the router,” for example, is not describing what actually happened, and instead of learning from the past, we’re engaging in a kind of lazy (but very comforting) wishful thinking. “Mike could have asked for help,” or “Mike should have done more testing in the lab,” or “Mike didn’t do the right thing,” are all counterfactuals, and are all evidence of hindsight bias.

All actions are gambles

A closely related cognitive error is outcome bias, which refers to the tendency to judge a decision by its eventual outcome (which isn’t known at decision time, of course). The same routine command we’ve used hundreds of times in the past – one that might have saved the day more than once – would suddenly crash the server because the system has drifted into failure, unnoticed by anyone. Now we’re no longer the hero; we’re just some careless cowboy administrator. That’s outcome bias at work.

It’s important to understand that every outcome, successful or not, is the result of a gamble. In his influential paper How Complex Systems Fail, Richard Cook observes the following:

[In complex systems] all practitioner actions are actually gambles, that is, acts that take place in the face of uncertain outcomes. The degree of uncertainty may change from moment to moment. That practitioner actions are gambles appears clear after accidents; in general, post hoc analysis regards these gambles as poor ones. But the converse: that successful outcomes are also the result of gambles; is not widely appreciated.

Why is that so hard to comprehend? Beyond Blame has one answer:

It’s a reflection of our misunderstanding of how complex systems function… Failure is a normal part of complex systems, yet it’s always so surprising when they fail. Why aren’t we more surprised when they function? … In complex systems failure is absolutely normal and expected. Malfunction is as ‘normal’ as ‘regular’ functioning.

We get most things right

Accepting that malfunction and “normal operation” are part of complex systems is one thing; actually learning from both failures and successes is another. I agree with Dave that postmortems tend to place too much focus on the former while neglecting the latter:

We also want to learn not just what went wrong… but what went right – what usually goes right. We’re typically overly focused on failures, forgetting that the same systems – including the people working in them – produce both positive and negative outcomes. Mostly positive, in fact – we certainly don’t have outages every hour or even every day!

Given the tech industry’s obsession with failure, I find this encouraging. We get most things right. Something to remember.

To wrap this up, here’s what we should do to prevent future incidents, which are inevitable, as best as we can:

Learning from both failures and successes. Feeding these learnings as signals back into the system, which will change and adapt to this new information. That’s why air travel has become as safe as it is over time – every time there is an accident or near-accident, it’s investigated, and the results are fed back into the system. This system includes air craft, traffic control, weather, engineers, and so on.

I couldn’t have said it better myself.

Further reading

I’ve been studying complex systems, postmortems, root cause analysis, etc. for some time now. Feel free to look into the following articles if any of this is of interest to you:

  1. How Complex Web Systems Fail - Part 1 and Part 2
  2. The Myth of the Root Cause: How Complex Web Systems Fail
  3. Writing Your First Postmortem
  4. On Finding Root Causes

Photo credits: Flickr


Updated: