Chaos Engineering, and fault injection in particular, is all the rage. Breaking things on purpose, rather than “by accident”, is what the cool kids do these days. It’s what they like to speak about at meetups and conferences or proudly promote on Twitter and their blogs. We’re starting to see job titles with the word “Chaos” in them, not unlike “DevOps” a couple of years ago (both meaningless, of course). Sarcasm aside, it’s evident that Chaos Engineering has become a technology trend, with more and more companies adopting it. While it might not have gone mainstream yet, we’re getting there for sure!
I’ve certainly contributed to the hype myself by publishing articles, giving presentations, and developing tools in the Chaos Engineering space. I even had a brief stint at a startup trying to sell “Failure as a service” – or “Resilience as a service”, one of the two – to enterprises. All in all, I think it’s fair to say that I’m an avid proponent of the practice. I believe that every Site Reliability Engineer should, at the very least, be familiar with the basics of proactive failure testing as a means to create better systems. (There’s an “R” in SRE for a reason.)
But here’s the rub: Being an advocate of something doesn’t mean you should close your eyes to its downsides and limitations. In fact, the most skilled engineers are well aware of the pros and cons of their favorite tool or method, and consider them carefully. I have the impression, however, that most discussions so far have focused almost exclusively on the advantages of Chaos Engineering – occasionally in a hyperbolic manner – without asking a lot of hard questions.
I, too, am guilty of this, as are many other Practitioners of Chaos (we definitely have the coolest names). This article is my attempt to take off those rose-colored glasses, at least for a moment, and put things into perspective. Believe it or not, Chaos Engineering does have its gotchas and limitations.
A Means to an End
First of all, it should be clear that Chaos Engineering is a means to an end, not an end in itself. Experimenting on a distributed system is of great worth, but what matters, in the end, is the production service you aim to improve in the first place. Breaking things is a ton of fun, I can attest to that, but as long as you don’t feed results back – by fixing flaws, tweaking runbooks, training people – your chaos experiments are rarely more than a time killer. And no, updating your mental models isn’t enough (that is unless you never forget anything and everyone has access to your brain). At the very least, write down any observations you make and follow up soon, if required.
Fault injection on its own won’t make your infrastructure more robust; people will. It should be obvious, but it’s not. Last year I reviewed an early draft of a book on Chaos Engineering. I was surprised to learn that there was no mention of any steps beyond the experimentation phase, as if that was the ultimate goal.
(A quick note: I’m well aware that fault injection and Chaos Engineering are not the same. When the latter comes up in practice, however, people almost always talk about inducing faults into distributed systems as an opportunity to learn, so cut me some slack here.)
One Step Forward, Two Steps Back
Even if you do the work, who’s to say that being able to uncover weaknesses will automatically lead to positive outcomes, like improved customer experience? As software developers know, identifying a bug and fixing it are two different challenges. Indeed, your optimization efforts in one area might increase brittleness in other areas, as David Woods points out:
[Expanding] a system’s ability to handle some additional perturbations, increases the systems vulnerability in other ways to other kinds of events. This is a fundamental trade-off for complex adaptive systems where becoming more optimal with respect to some variations, constraints, and disturbances increases brittleness in the face of variations, constraints, and disturbances that fall outside this set.
One Among Many
Chaos Engineering is not a remedy for all of your reliability concerns, and it never will be. It’s merely one of many approaches used to gain confidence in system correctness (typically in the face of perturbation). Consider it required but not sufficient. And by no means is it – or should it be – the only way to learn from failure. As John Allspaw notes in his article on fault injection in production:
[GameDay] exercises aren’t meant to discover how engineering teams handle working under time pressure with escalating and sometimes disorienting scenarios. That needs to come from the postmortems of actual incidents, not from handling faults that have been planned and designed for.
Proactive failure testing and post-incident reviews go hand in hand. As we will see next, it’s a mistake to assume that doing enough of the former makes up for neglecting the latter (and arguably vice versa). Besides, neither of the two methods is a substitute for, say, proper monitoring and unit testing. All these practices complement each other.
Too Brittle or Too Reliable
It goes without saying that no amount of Chaos Engineering will fix a Big Ball of Mud (when duct tape is holding the architecture together). Try to design for failure at all levels of your system. Address known issues before inviting more chaos into brittle infrastructure.
At the same time, don’t overdo it. Solve the business problem at hand. You’re not Google. Well, actually, Google is a bad example because they do know that systems can be too reliable:
If a system gets too reliable, then the team who runs it feels like they need to keep it that reliable, even though there are potential failure modes that are very expensive to mitigate.
Chaos Engineering must make good economic sense (remember: it’s a means to an end).
Systems Will Continue to Fail
I’ve argued before that negative visualization – imagining what could go wrong to prepare for disruption – is an essential part of Chaos Engineering. However, it’s also one of its major limitations, as Allspaw points out in his article mentioned above:
The faults and failure modes are contrived. They reflect the fault designer’s imagination and therefore can’t be comprehensive enough to guarantee the system’s safety. While any increase in confidence in the system’s resiliency is positive, it’s still just that: an increase, not a completion of perfect confidence. Any complex system can (and will) fail in surprising ways, no matter how many different types of faults you inject and recover from.
It may sound overly pessimistic, but while Chaos Engineering surely is a net plus, impermanence makes sure that all complex systems will fail no matter how hard we try to avoid it (which is exactly why postmortems are so important). The Holy Grail of Automation – introducing faults automatically instead of manually – won’t change that fact a bit. Don’t fool yourself and set realistic expectations.
Now I can almost hear you scream: “But there’s Lineage-driven fault injection! We don’t need to dream up any failure modes!” I hear you. LDFI is, without a doubt, a great achievement and I’m looking forward to seeing more implementations of it in the wild. However, everyone who has read the paper knows that LDFI comes with its own list of requirements and limitations, which is probably the reason why we haven’t heard much about it outside of academia so far.
The Paradox of Automated Fault Injection
Speaking of automation, there’s an interesting paradox worth highlighting. Again, quoting from Allspaw:
If the faults that are injected (even at random) are handled in a transparent and graceful way, then they can go unnoticed. You would think this was the goal: for failures not to matter whatsoever when they occur. This masking of failures, however, can result in the very complacency that they are intended (at least should be intended) to decrease. In other words, when you’ve got randomly generated and/or continual fault injection and recovery happening successfully, care must be taken to raise the detailed awareness that this is happening – when, how, where, etc. Otherwise, the failures themselves become another component that increases complexity in the system while still having limitations to their functionality (because they are still contrived and therefore insufficient).
We like to talk at length about how visibility is key to detecting issues during chaos experiments – and maybe aborting experiments to prevent further harm – but let’s not forget building awareness of what’s going on in the first place.
I recently found a paper on resilience that, among other things, tries to illustrate that antifragility does not exist, at least not in the sense of systems being universally antifragile. Here’s the relevant passage:
In general, for systems subjected to variability, noise, shocks and other random perturbations, it is possible to develop strategies that, on average, benefit from variability, but not any variability. Such strategies are designed to profit from the variability of particular stressors. Simultaneously, they are vulnerable to other stressors.
Whether you believe in antifragility or not (I have my doubts), the variability argument is spot-on. No system can withstand all turbulent conditions, but only specific ones – some of them thanks to Chaos Engineering. The somewhat worn-out vaccination analogy drives the point home: We inject something harmful into a system to build an immunity to it, where “it” refers to the particular “disease” we want to fight. While we might be better off overall, nobody and nothing ends up being invincible this way.
We can also look at the problem of variability in terms of state space, as Caitie McCaffrey has done for us:
Fault-injection testing forces [failures] to occur and allows engineers to observe and measure how the system under test behaves. If failures do not occur, this does not guarantee that a system is correct since the entire state space of failures has not been exercised.
You Can’t Have a Rollback Button
Last but not least, let’s look at a seemingly innocuous implementation detail.
The rollback button is a lie. That’s not only true for application deployments but also for fault injection, as both face the same fundamental problem: state. Yes, you might be able to revert the direct impact of non-destructive faults, which can be as simple as stopping to generate CPU/memory/disk load or deleting traffic rules created to simulate network conditions. But no, you can’t roll back what has been inflicted on everything else in the system – the targeted application and everything that interacts with it. A prime example is corrupt or incorrect data stored in a database/queue/cache due to a program crash.
One might argue that it is the very goal of chaos experiments to reveal such weaknesses, and that’s exactly right. However, having a rollback/revert button promising to quickly get back to safety is, strictly speaking, a scam. That isn’t to say we should do away with these safeguards entirely. It just means we should stop implying that all, or even most, actions can be quickly reversed, which may cause engineers to take more risk than necessary.
And we each forget, every day, how much care we need to take when using our seemingly benign tools; they are so useful and so sharp.
– Michael Harris, The End of Absence
Chaos Engineering has many benefits – I can’t imagine a tech world without it – but all that glitters is not gold. Please stop pretending that it is.