3 minute read

Toil.

I’ve stumbled upon this word a few times already, mostly in the context of automation or system administration. I knew that it means something negative, something that should be avoided. But to be honest, it was only through writing this article – a useful technique for closing knowledge gaps, by the way – that I figured out what the word actually means when used in the world of engineering.

Here’s what I’ve learned.

What exactly is toil?

For research, I’ve consulted two of my favorite technical books. The Cloud book mentions the term a couple of times in regard to automation and lists ways to assess and limit the amount of toil. The SRE book, the primary source for this article, even has an entire chapter on toil and how to eliminate it.

The SRE book defines toil as follows:

In SRE, we want to spend time on long-term engineering project work instead of operational work. Because the term operational work may be misinterpreted, we use a specific word: toil. […] Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.

More precisely, the following work can be considered toil:

  • Work such as manually running a script (even if the script itself automates some task)
  • Work that is performed over and over again
  • Work that could just as well be accomplished by a machine (human judgment isn’t essential)
  • Work that is interrupt-driven and reactive, like pager alerts, rather than strategy-driven and proactive
  • Work that does not permanently improve your service once completed
  • Work that scales up linearly with service size, traffic volume, or user count

Why is toil a bad thing?

Toil is not a bad thing per se. There’ll always be some amount of toil – unavoidable grunge work – that you need to take care of as an engineer. You might even look forward to these quick wins from time to time. That’s fine. There’s a problem, however, if toil makes up the majority of your work.

Here are some of the reasons why too much toil is harmful, again taken from the SRE book:

  • Too much toil leads to burnout, boredom, and discontent.
  • Manual work and firefighting will prevail at the expense of shipping new features.
  • Too much toil creates confusion about what SRE actually entails.
  • Other (development) teams may start expecting SREs to take on even more toil.
  • Current or future teammates are more likely to look for another job, esp. if they were promised project work.
  • Your career will stagnate if you spend too little time on long-term engineering projects.

I think the last aspect is particularly important. But what exactly qualifies as engineering work?

Engineering work defined

Engineering work is novel and intrinsically requires human judgment. It produces a permanent improvement in your service, and is guided by a strategy. It is frequently creative and innovative, taking a design-driven approach to solving a problem – the more generalized, the better. Engineering work helps your team or the SRE organization handle a larger service, or more services, with the same level of staffing.

Examples include:

  • Creating automation scripts, tools, or frameworks
  • Adding service features for scalability and reliability
  • Modifying infrastructure code to make it more robust
  • Configuring and tuning production systems (servers, load balancers, etc.)
  • Setting up monitoring and alerting
  • Writing documentation that has long-term value
  • Consulting on architecture, design and putting systems into production

The 50/50 rule

Without a doubt, knowing what work falls into which category – toil or long-term engineering work – is useful. Another thing you need to know is how to split your valuable time between the two.

Fortunately, the SRE book has a pragmatic answer for this too:

Our SRE organization has an advertised goal of keeping operational work (i.e., toil) below 50% of each SRE’s time. At least 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil or add service features. Feature development typically focuses on improving reliability, performance, or utilization, which often reduces toil as a second-order effect.

From now on, I’m going to strive for spending, on average, at least 50% of my time on engineering work. Furthermore, it’s time for me to stop using the phrase “monkey work” because, as we learned today, there exists a much better, more empathetic word.

Photo credits: Flickr


Updated: