I started reading the Site Reliability Engineering book, commonly referred to as “the SRE book”. In this collection of essays, members of Google’s SRE team write about how they run their production systems. The book contains a short chapter on a topic near and dear to my engineering heart: simplicity and the role it plays in SRE. In this article, I’m going to summarize the lessons from that chapter and add some thoughts of my own.
Stability vs. agility
SRE is about keeping agility and stability of production systems in balance. Running a reliable service is easy – because it’s frozen. But: running an agile reliable service is hard. The challenge is to change an application that is running correctly, while keeping the service running correctly.
Reliable processes can increase developer agility. Reliable production rollouts, for example, make it easier to link changes to bugs. This in turn allows developers to focus on more important things, such as functionality and performance of their software.
This separation of concerns is, in fact, one of the reasons why cluster managers like Kubernetes exist. Developers can talk to the Kubernetes API to deploy their application containers in a simple and reliable way without having to worry about cluster nodes, host OS/kernel, or underlying hardware.
Accidental vs. essential complexity
Software should behave predictably and accomplish its goals without too many surprises (that is, outages in production). The number of surprises directly correlates with the amount of unnecessary complexity found in a project. It’s therefore crucial to think about accidental complexity and essential complexity:
Accidental complexity relates to problems which engineers create and can fix, [whereas] essential complexity is caused by the problem to be solved, and nothing can remove it
– Fred Brooks in his seminal “No Silver Bullet” essay
SRE teams should push back when accidental complexity is introduced into the systems they’re in charge of. Besides, they should continue to eliminate complexity over time.
As a matter of fact, it can be hard to know what is essential complexity and what is accidental complexity. From my experience, you can successfully avoid unnecessary complexity by:
- Constantly asking yourself what it is you want to accomplish
- Thinking deeply about the problem domain
- Saying “no” by default
- Outsourcing work to service providers, if possible
(I’m aware that these points need more elaboration, most likely in the form of another article.)
Code is a liability
It’s poor practice to comment out unused code, or worse, to gate it with a feature flag. Code that has no purpose is a major source of distraction and confusion. Today’s version control systems make it easy to revert any changes; there’s no reason not to remove dead code and other bloat. Less code means less complexity, which means less bugs, which means less unexpected outcomes in production. Deleting many – sometimes hundreds or thousands – lines of code is indeed very satisfying.
At the same time, we should think twice before adding new features. Once again, this comes down to essential vs. accidental complexity and saying “no” to many things in order to focus on the core problem.
Writing clear, minimal APIs is key to managing simplicity in software systems. Smaller APIs with fewer methods and arguments are not only easier to understand and test, they also allow us to put more effort into comprehending the actual problem we set out to solve.
As you can see, there’s a pattern involved here: less is more.
Modularity and decoupling
Many concepts of object-oriented programming (OOP) also apply to the design of distributed systems. For instance, both involve breaking problems up into small, manageable components. No surprise then that both benefit from loose coupling – the ability to update parts of a system in isolation – as an effective method for increasing developer agility and system stability. A decoupled system, in particular, reduces the probability of unintended consequences.
The idea of modularity also extends to APIs. API versioning allows developers to upgrade individual components to a newer version in a controlled way, rather than forcing all teams to upgrade in “big bang” fashion. What appears to be an extra burden at first, actually allows introducing new features and deprecating old ones in a safe manner.
Similar to how all code should have a purpose, any part of a distributed system – microservices, binaries, etc. – should be responsible for solving one particular well-defined problem. Those parts are connected by clearly defined interfaces (API, CLI, etc). This, in fact, is the Unix philosophy at work.
Nobody likes to review pull requests that are too long. It’s hard to measure the impact of many code changes at a time, even more so if the changes are unrelated. The same is true for releases. It’s difficult to understand the impact (e.g. on performance) when deploying dozens of unrelated changes to a system at the same time.
Simple releases performed in smaller batches of changes are easier to measure – and to revert, if necessary.
I want to end this article, quite appropriately, with one of my favorite quotes:
Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.
– Kernighan and Plauger in “The Elements of Programming Style”
The more complex a system, the more difficult it is to build a mental model of the system, and the harder it becomes to operate and debug it. As Edsger W. Dijkstra put it, “Simplicity is prerequisite for reliability”. It’s important to put a lot of thought into simplifying our designs while still providing the required functionality. Systems tend to become more complex over time – the earlier you start simplifying as outlined here, the better off you will be.