Being on-call can be a nerve-wracking experience. Something might break any minute. It’s unpredictable. Yet, you’re in charge when a server explodes (and they like to explode in the middle of the night). Knowing this might freak you out. At least that’s what happened to me when I was on-call for the first couple of times after joining Jimdo. As a bloody beginner, I was afraid of what might happen. I was afraid of that buzzing alert sound to ever go off. Would I be able to handle the situation with the little training I had? All I wanted was to survive whatever on-call shift I was on. It stressed me out.
Before moving to the cloud, every engineer who participated in our on-call rotation had to deal with a relatively large number of incidents over time – most of them minor in severity but high on the annoyance scale (lighttpd backend overloaded anyone?). Due to the nature of our architecture, we often experienced cascading failures – failures that spread from one part of the system to another, leading to a good deal of alerts for us to handle. Once a server was in production, we tried to touch it as little as possible – unless something was broken, of course. And things did break all the time.
That was three years ago. A lot has changed since then. We finished migrating our core infrastructure from bare metal to the cloud. AWS is doing an amazing job when it comes to reliability. This and the fact that we had to redesign our software for the cloud – and yes, we no longer use lighttpd – led us to where we are today: outages are rare, our customers are happier, and we get a lot more sleep.
So all is good now, right?
When failure is the rule
It is beyond question that having too many alerts – and fighting too many fires at the same time – is a real problem, as it might lead to alert fatigue, sleep deprivation, and a good amount of frustration. In other words, plenty of stressful situations that used to keep us busy and distracted us from our actual work.
However, the fact that we were dealing with failure on a regular basis also had some advantages:
- Everyone knew, for the most part, what to do when shit hit the fan. We all had experience with the most common failure modes. Like a well-oiled machine, we followed pretty much the same procedures when it came to incident communication, documentation, escalation, etc.
- Due to the sheer amount of alerts, we knew that our monitoring and alerting systems worked in principle (leaving aside the fact that some alerts were questionable). These days, alerts are so rare that I sometimes wonder if PagerDuty is broken.
- Last, but most importantly, we never believed that our system was particularly resilient to failures. On the contrary, we were always wary of what might happen next.
When failure is the exception
On the other end of the spectrum, when failure is the exception rather than the rule, you risk losing every single advantage I outlined above. As John Allspaw put it in his article, Fault Injection in Production:
If a system has a period of little or no degradation, there is a real risk of it drifting toward failure on multiple levels, because engineers can become convinced – falsely – that the system is experiencing no problems because it is inherently safe.
Just because there are no problems now doesn’t mean it’s going to stay that way. Sooner or later, any complex system will fail. It’s inevitable. That’s why you should never get too comfortable.
There’s a word for this very condition: complacency. Merriam-Webster defines complacency as “self-satisfaction especially when accompanied by unawareness of actual dangers or deficiencies”.
Complacency is the enemy of resilience. The longer you wait for disaster to strike in production – merely hoping that everything will be okay – the less likely you are to handle emergencies well, both at a technical and organizational level.
How to fight complacency
Ultimately, our goal is to be aware of actual dangers and to prepare for them accordingly. Learning from outages after the fact is important (and that’s what postmortems are for), but it shouldn’t be the only method for acquiring operational knowledge. Allspaw agrees, as he continues:
[…] building resilient systems requires experience with failure, and that we want to anticipate and confirm our expectations surrounding failure more often, not less often. Shying away from the effects of failure in a misguided attempt to reduce risk will result in poor designs, stale recovery skills, and a false sense of safety.
This observation may come as a surprise for those who believe that systems should never fail, but waiting for things to break in production is not an option. We should rather inject failures proactively in a controlled manner in order to prepare for the worst and gain confidence in our systems. This is the core idea behind Chaos Engineering and related practices like GameDay exercises. Again, Allspaw summed it up best:
[…] failure-inducing exercises can serve as “vaccines” to improve the safety of a system—a small amount of failure injected to help the system learn to recover. It also keeps a concern about failure alive in the culture of engineering teams, and it keeps complacency at bay.
These days, we’re running GameDay exercises periodically at Jimdo – making sure we never get too comfortable. We know that, once again, embracing failure is key to building resilient systems.