Last time I introduced you to Chaos Engineering, a discipline based on the idea that proactively triggering failures in a system is the best way to prepare for disaster.
My goal was to show you how to get started running your own chaos experiments, mostly on a technical level. I left out some things by necessity. For example, I only briefly mentioned that it would be a good idea to introduce a company to the concept of Chaos Engineering by starting small, rather than wreaking havoc on production from the get-go. I took it as given that chaos experiments are endorsed by everyone. Of course, that’s not the case.
Chaos Engineering, or engineering in general, doesn’t happen in a vacuum. There are people involved. Chaos Engineering isn’t merely a technical solution to the problem of building resilient systems. For it to work at all, it requires a fundamental shift in the mindset of managers and employees, if not whole companies.
Failure is a hard sell
For many companies, failure is unacceptable. Management considers outages, big or small, to be intolerable. They believe that systems should never fail. And they spend a lot of money in trying to make this dream a reality (while never getting there). Most of their managers, especially those of non-technical departments, will argue that chaos experiments offer little benefit and carry a substantial risk.
In these companies, people are more likely to cover up a mistake than deal with it openly because they’re afraid to be punished for failure. This blame culture is seriously toxic. Not only does it hurt morale, it also hinders innovation through collaboration, thus depriving the opportunity to learn from mistakes together.
And it’s not just managers. For engineers who try to avoid emergencies altogether, it too is difficult to stand by as their systems break in ways they can’t possibly imagine. After all, a deployment pipeline running automated tests ought to be enough, right?
Introducing Chaos Engineering – forcing systems to fail – seems to be impossible in such circumstances. Indeed, it would take more than a technological shift. It would need a change in culture.
Learning to embrace failure
Luckily, there are other companies, like Etsy or Stripe, that have learned to embrace failure. They realize that failure is unavoidable, whether we’re developing software or managing people or doing something else entirely. They understand that the key to building resilient systems is to accept that failure is a part of life. Unsurprisingly, these companies are the first to adopt Chaos Engineering or similar practices, and therefore are better equipped to manage outages and other surprises once they do happen.
I know this because I’m privileged to be working in a place where failure isn’t a disaster, but a learning experience. I wouldn’t go so far as to say that failure is part of our company culture – not yet – but engineers are trusted to do the right thing, with plenty of room for (chaos) experiments. Besides trying to anticipate where something might break in the future, we’re also having blameless postmortems on outages and accidents as a means to learn from past events.
In fact, it happened more than once that I was on-call when something bad happened which I eventually had to escalate to my boss. In the end, we always fixed problems by working together as a team. The inevitable postmortems that followed helped us understand how accidents actually happened – without blaming individuals – and how we might better prepare for the future.
How to make the case for chaos experiments
Again, I know I’m in a privileged position. You might not be so lucky. The challenge is to convince people – the right people – of the value of Chaos Engineering. You want them to see how much there is to learn from failure and that the benefits, such as increased confidence in the system, outweigh the costs.
If you’re having a hard time selling the idea of Chaos Engineering, here are a couple things that may or may not work for you:
- Look out for outages and other failures. They can give you the ammunition you need to get the approval for some smaller chaos experiments.
- Try to find a sponsor, someone in power who understands the value of resilience testing and is willing to promote it throughout the company.
- Suggest to experiment in a non-production environment first (as mentioned before). Even though there are downsides to this – some behaviors can be seen only in production; repairing a broken test environment is often a lot of work – you can expect to get less pushback.
- Make your systems fault-tolerant to the best of your ability before running chaos experiments. I hope this is obvious.
- Mitigate risk as much as possible and communicate your efforts. Make sure to carefully review experiments, monitor them closely, and have experts at hand who can revert changes should things go wrong.
- Keep stakeholders in the loop. Announce experiments early enough. They can be risky and disruptive. You don’t want to lose trust before having a real chance to earn it.
- Share insights with other teams. Help them get started with their own experiments. More teams are likely to join once they see the results.
- If successful, tell everyone how Chaos Engineering was instrumental in improving resilience. Advertise it as much as you can!
I hope this helps.
Building resilient systems is an ongoing, iterative process. Making a compelling case for Chaos Engineering is a good first step in changing the prevailing mindset that failure is a bad thing.
This piece was, to a great extent, inspired by these ACM articles:
- Resilience Engineering: Learning to Embrace Failure – a discussion with Jesse Robbins, Kripa Krishnan, John Allspaw, and Tom Limoncelli
- The Antifragile Organization by Ariel Tseitlin
- Weathering the Unexpected by Kripa Krishnan
- Fault Injection in Production by John Allspaw
I strongly recommend reading all of them if you want to dig deeper.