Mathias Lafeldt

Giving Rust Another Shot in 2020

2020-07-23T00:00:00+02:00

These days I turn to recreational programming more often than usual. Solving fun problems makes me feel less confused and helpless about the pandemic. It’s a coping mechanism and a good one at that.

One thing has been an especially surprising source of joy for me: Rust. Surprising because I’ve been known to have an on-off relationship with the programming language since its 1.0 release. I quit Rust more than once – both professionally and personally – mainly out of frustration with a subpar development experience and general lack of libraries.

Looking back, I’m thrilled that I gave Rust another chance this year.

That was then

Back in 2017, I wrote that “Rust lets you control almost everything, but as a beginner, it’s also hard to compile anything. As a result, Rust continues to make me look bad.” I was disappointed. Again.

At the time, I was working for a chaos engineering startup that chose Rust for simulating common server problems such as high CPU load, increased network latency, or random application crashes. A reasonable choice if your goal is to inject low-level faults in a predictable way, right? After all, Rust is said to be our best chance at safe systems programming. That may very well be true.

However, back then, I didn’t think Rust was the best option for the client side of the product. The main reason was productivity – or rather the lack thereof.

In 2017, my day-to-day Rust experience boiled down to the following:

Writing code in Vim with bare-bones syntax highlighting
No code completion and no “go to definition” (the Rust Language Server was still in its infancy)
Effectively no code formatting, since rustfmt just gave up whenever a line was too long (seriously)
Docs were hard to find or non-existent
The same goes for high-quality code examples
No official Rust Docker images, homegrown CI pipelines only
Compile times were noticeably slow, resulting in long feedback loops
Speaking of feedback, Rust’s borrow checker made it difficult to understand why it rejected a piece of code and how to fix it

All these drawbacks add up, but what really bothered me was the immature crate ecosystem. To give you one example, when we implemented the ability to inject faults into containers, we couldn’t find any usable Rust client library to talk to the Docker API, let alone Kubernetes. As a result, we were forced to reinvent the wheel several times (and sometimes ended up with a flat tire, in typical startup fashion).

In the face of these obstacles and despite my appreciation for its language design, I often found it challenging to get real work done with Rust.

At least, that was then.

This is now

Today, three years later, I’m happy to report that Rust has changed a lot, and for the better! That became apparent to me when I recently had some extra time between freelance contracts and decided to port a crypto library from C to Rust (because that’s what I do to get out of the cloud consulting bubble once in a while).

I realized that the release of the 2018 edition is particularly noteworthy. It made the language much more productive in several ways: simpler syntax, a smarter borrow-checker, an improved module system, enhanced tooling (both rustfmt and Clippy, Rust’s linter, hitting 1.0), and a host of other features. Add to that IDE support, incremental compilation, excellent documentation, and immensely helpful error messages, and what you now get is nothing short of remarkable.

Thanks to Rust’s soaring popularity – it has been Stack Overflow’s most loved language for four years in a row – finding a particular crate or tool has become easier (though still not necessarily easy). There are more and more open source projects to explore. There’s a growing list of third-party cargo subcommands (cargo-geiger and cargo-asm being two of my favorites). There’s even a decent REPL. CI/CD is a breeze, and so is cross compilation to dozens of platforms. In short, there’s a lot to love here.

Given these advances, it’s no wonder that learning Rust is also quite different now. The tonari blog captured it well:

A year after switching over to Rust, we onboarded our fourth engineer to the team, who didn’t have much prior experience in either Rust or systems engineering. While the learning curve is undeniable (borrow checker, my old friend), we’ve found that Rust is incredibly empowering for those new to lower-level programming.

Empowering is the perfect word to describe Rust in 2020. What used to be a rough adventure with many pitfalls has turned into something beautiful, something that can lift your spirit. At least, that’s what it did for me.

I encourage you to immerse yourself in a little Rust project today. You might be in for a pleasant surprise.

Update: Reddit thread

7 Things I Learned From Porting a C Crypto Library to Rust

2020-07-01T00:00:00+02:00

Rust has always been the programming language that reminds me the most of my game hacking days, and for good reasons. Rust is a natural fit for embedded systems like video game consoles – or rather emulators thereof. The compiler supports a high number of platforms and lets you drop down to C or assembly if necessary. Plus, the language lends itself well to implementing (reverse-engineered) cryptographic algorithms and other low-level shenanigans.

Given these benefits, it’s no surprise that I keep coming back to Rust. This time, I decided to revisit a pull request from five years ago, which has been lingering in my mind ever since. The ambiguous goal of the pull request is to port cb2util, one of my old crypto tools for PlayStation 2, from C to pure Rust.

Among other things, cb2util allows you to decrypt all cheat codes for the notorious CodeBreaker cheat device, making it possible to scrutinize them in their raw form and use them with other devices.

To give you an idea, the following code might make your character invincible by constantly writing the byte value 99 (0x63) to memory address 0x0096F5B8:

Infinite health
0096F5B8 00000063

The very same code but encrypted:

Infinite health
81E1C95B 9764DA87

Back in 2006, it took me weeks to reverse-engineer the proprietary encryption scheme from MIPS 5900 assembly and convert everything to C in a piecemeal fashion. Naturally, I learned a ton in the process. Trying to port those same crypto routines from C to Rust now, 14 years later, sounded like another promising learning opportunity.

Luckily, I wasn’t disappointed. In fact, I went so far as to extract and publish everything that I did as a Rust crate. It’s called, well, codebreaker.

Here are seven things I picked up while working on this little side project – some of them specific to cryptography, others more general in nature.

Soft migration via FFI

Rust provides a foreign function interface (FFI) to talk to other programming languages with ease. This interface allowed me to gradually port one C function at the time until there was no foreign function call left. Since the crypto code involves a state machine, I also had to temporarily export a few global C variables in order to access them from Rust. A large set of existing integration tests ensured that I didn’t break anything along the way, which was very helpful (thanks, past me!).

Wrapping operations

In C, unsigned integer arithmetic is defined to be modulus a power of two, which is almost always what you want when it comes to cryptography. To achieve the same in Rust without running into any overflow errors, you have to use dedicated wrapping operations for modular addition (wrapping_add), subtraction (wrapping_sub), multiplication (wrapping_mul), etc.

Here’s an example for 32-bit unsigned integers:

assert_eq!(200u32.wrapping_add(55), 255);
assert_eq!(200u32.wrapping_add(u32::MAX), 199);

Transmuting types

Due to the nature of the encryption scheme, it’s sometimes necessary to read or write the same data in two different ways. For example, I defined an RC4 key as a fixed-size array [u32; 5], which is convenient most of the time. That is, until the key itself needs to be encrypted as a slice of type &[u8].

Doing that in C via pointer casting is easy enough, but it took me a while to figure out how to implement it in Rust. I found one (albeit unsafe) solution in the byteorder crate for transmuting types – reinterpreting the bits of a value of one type as another type:

unsafe fn slice_to_u8_mut<T: Copy>(slice: &mut [T]) -> &mut [u8] {
    use std::mem::size_of;

    let len = size_of::<T>() * slice.len();
    slice::from_raw_parts_mut(slice.as_mut_ptr() as *mut u8, len)
}

Update: Thanks to Reddit user DannoHung, I switched to using bytemuck for transmuting. See this pull request for details.

RSA with num-bigint

In addition to RC4, I needed a simple RSA implementation for small 64-bit keys (the size of one cheat code). I used to rely on libbig_int, a portable C library to calculate integers of arbitrary length. Now I needed something similar for Rust. That something turned out to be the superb num-bigint crate.

fn rsa_crypt(addr: &mut u32, val: &mut u32, rsakey: u64, modulus: u64) {
    let code = BigUint::from_slice(&[*val, *addr]);
    let m = BigUint::from(modulus);

    if code < m {
        let digits = code.modpow(&BigUint::from(rsakey), &m).to_u32_digits();
        *addr = digits[1];
        *val = digits[0];
    }
}

No stdlib

Speaking of num-bigint, I studied its sources intensively and learned how to write code that can also be used for embedded environments such as game consoles (another CodeBreaker clone, anyone?). Said code must not depend on Rust’s standard library. This exercise showed me how straightforward conditional compilation and optional dependencies are in Rust.

mod_inverse using Newton’s method

As the saying goes, “make it work, make it right, make it fast”. After porting all crypto routines to Rust, I did some local optimizations. One of them involved replacing a messy reverse-engineered function with something much more mathematically sound. As luck would have it, I stumbled upon a blog post by Daniel Lemire that describes an elegant method for computing the multiplicative inverse of odd integers. I proudly present the diff:

Automated testing with actions-rs

Being the diligent engineer that I am, I devoted some time to unit tests and doctests (executable examples in the documentation). I also became friends with Clippy, Rust’s amazing linter, which helped me write code that’s more correct, more readable, and more idiomatic. Finally, I brought everything together in a handy CI pipeline using actions-rs, the GitHub Actions toolkit for Rust.

All in all, I can confidently say that the Rust port is better than the original in many ways. But see for yourself.

If you have any additions or questions, feel free to ping me on Twitter (@mlafeldt). I’m always eager to learn something new! ✌️

Update: Really good Reddit thread

Recreational Programming with Serverless

2020-06-08T00:00:00+02:00

This article is a slightly edited version of the presentation I gave at the AWS Community Day 2019 in Hamburg, Germany. You can find the original slides here. There’s no recording available, so you have to take my word that the following actually happened.

I know it’s been nine months since the event took place, but I still find myself thinking about the importance of recreational programming, especially in difficult times like this. That’s why I decided to share the full content here and now. Besides, following the “show, don’t tell” rule, I’ve recently open sourced the serverless project that is at the heart of the talk.

One more note before we get started: If I appear overly critical of Kubernetes at times, it’s because I tell the story from the perspective of someone who’s built and operated container systems in different cloud environments since 2015. Mere users of Kubernetes will undoubtedly appreciate it more than I do. Complexity has to live somewhere, after all.

In any case, enjoy the presentation.

Hello, and welcome to my lightning talk on recreational programming with serverless!

I want to start with a little introduction. My name is Mathias Lafeldt. I’m from a small town north of Hamburg, so I’m a local if you will.

I’ve been using AWS since 2013. These days, I help companies embrace the cloud as a freelancer – a very happy freelancer, to be honest.

Most of my current consulting work focuses on Kubernetes and AWS. I help teams set up AWS accounts in a sustainable way, do network planning, pick the right cluster setup for them, create CI/CD pipelines, implement monitoring, etc. – basically, all those things that are required to make any use of Kubernetes on AWS.

As a freelancer, I see a high demand for migrating software to the cloud. A lot of organizations choose Kubernetes for modernizing their legacy applications. I understand the reasoning behind this decision:

It’s relatively easy to package your code and move it (almost as-is) to the cloud in the form of containers.
Kubernetes streamlines the development process by automating the deployment and scaling of containerized applications.

I think it’s true that Kubernetes can provide value. However…

It’s also true that Kubernetes is a complex beast with many moving parts, and therefore many different failure modes. To make matters worse, it’s growing by the minute as its ecosystem is constantly evolving.

You may know this website. It was created by Henning Jacobs from Zalando. He built this collection of Kubernetes failure stories for people to learn from each other and not repeat the same mistakes again and again.

For me, this site’s sheer existence is a testament to the fact that there’s a lot to deal with when you want to be successful with Kubernetes in production.

I like to joke that (self-hosted) Kubernetes clusters are the perfect job creation measure. They’re also well suited if you want to get good at writing postmortems. I know it first-hand, as I had to write a couple of those myself…

Now you may ask, what about EKS, Amazon’s managed Kubernetes service? [The talk right before mine was about EKS.]

From my experience, EKS offers significant advantages over tools like kops. For one, it will give you a managed control plane that takes care of running the API servers and etcd for you. However, you still need to bring your own EC2 worker nodes, which means you’re still responsible – at least to a degree – for high availability, AMI updates, EBS volumes, backups, VPC design, load balancers, and much more.

[Note: AWS has released EKS managed node groups in the meantime, which certainly reduces the total cost of ownership (TCO) of Kubernetes clusters, especially when it comes to updating worker nodes. The overall TCO nevertheless continues to represent a significant administrative burden on the shoulders of operators. Configuration of Kubernetes remains a non-trivial task.]

While Kubernetes might be perfect for you or your company (understandably so given its advantages), I personally find working with it for extended periods exhausting. As a cluster operator, it’s stressful to be on the hook for so many things that can go wrong – and will go wrong, as demonstrated by the failure stories website mentioned earlier.

Ultimately, it should be about the applications, right? We don’t have a Kubernetes cluster just for the sake of having a Kubernetes cluster. Its whole purpose is to make it easy for us to deploy, scale, and manage our containerized applications.

Sadly, despite considerable efforts, we still get sucked into a never-ending cycle of operational tasks, tasks we never signed up for. The result for me has been mental fatigue and, at times, lack of motivation to deal with such systems.

At some point, I discovered a podcast episode with Jamis Buck. Jamis is a famous programmer. He created Capistrano and plenty more open source projects. On the show, he talks about how he used to be on the top of the world about ten years ago. He worked for 37signals (now called Basecamp), earned a big paycheck, everything seemed perfect. There was only one problem: he was burnt out.

As a consequence, he had to let go of most of his side projects. Unfortunately, that didn’t help much. To overcome burnout, he left his fantastic job and decided to write a book on one of his passions: generating mazes. Mazes helped Jamis remember what got him excited about programming initially. Mazes are his form of recreational programming.

I was wondering what my kind of recreational programming would be? Maybe I already found it but didn’t know it yet.

Werner Vogels recently published a blog post titled “Modern applications at AWS”. In it, he describes how Amazon has been successful for 20 years by going through a series of radical transformations, always questioning how they build applications and how they organize the company.

According to Werner, organizations must adopt five elements to increase agility and innovation speed:

Embrace microservices to decouple systems and enable autonomous, cross-functional teams.
Use purpose-built databases for each microservice rather than a single database for all microservices, which can’t meet specific needs and is a single point of failure.
Enable teams to release changes independently, for example, by providing best-practice infrastructure-as-code templates.
The same goes for security as “in modern applications, security features are built into every component of the application and automatically tested and deployed with each release.”
Be as serverless as possible and offload undifferentiated tasks to AWS services such as Lambda.

Even Amazon is not completely serverless yet, but they’re getting there. Werner believes that thanks to serverless, there will soon be a whole generation of developers who have never touched a server and only write business logic.

That sounds like a bright future to me. ☀️

Of course, there’s more to serverless than programming with Lambda. For starters, it’s not just Lambda but the entire application stack, including services like DynamoDB, S3, SNS, API Gateway, etc. More generally speaking:

Serverless services are managed services that run without the need for infrastructure provisioning and scaling.
They provide built-in availability and security. No need to care about availability zones or kernel patches.
You only pay for what you use. You don’t pay for idle resources.
Serverless allows you to focus on business logic – your “secret sauce”, the things that set you apart from your competition.
With serverless, you can create value for customers faster.

Looking at this list made me pause for a second. Those are indeed excellent reasons for startups and enterprises to go serverless.

More to the point, I realized that a lot of properties that make serverless great for businesses – no servers to manage, easy deployments, pay-as-you-go with a generous free tier – also make it a great fit for recreational programming!

One advantage stands out in particular to me: the ability to concentrate on my applications – the serious ones and the not-so-serious ones.

Next, I want to show you a few serverless projects I created for fun, projects I consider recreational in some sense.

I’m a huge fan of Dilbert and a couple of other comic strips. I also love reading articles and comics using my favorite RSS feed reader: Feedly on iOS.

The problem with Dilbert is that although there’s an official RSS feed, it no longer includes the comic strips themselves but only links to dilbert.com. How mean and inconvenient!

In 2017, I started looking at this obstacle as an opportunity for an interesting serverless application – bluntly named dilbert-feed – to create my own feed that I can enjoy in Feedly again.

Here’s how I went about it.

First of all, I had to figure out how to get the images. Fortunately, Dilbert is a daily strip and the URLs are predictable, e.g., dilbert.com/strip/2019-08-30. That means I only had to download the web page for a given day and parse the HTML to get a link to the image.

I decided to use Go – my favorite programming language – for the job. I was lucky to find the superb goquery package, which does most of the work for me, as you can see on the slide above.

Now that I knew how to get the images, I used the Serverless Framework to turn that code, which has to be deployed somewhere where it can be invoked once a day, into a Lambda function called get-strip. After determining the image URL, the function will copy the found image to an S3 bucket via the always handy AWS SDK for Go.

To separate concerns, I wrote another Lambda named gen-feed that generates the RSS feed for the last 30 days (for this, all it needs to know is the location of the images uploaded by get-strip).

For the feed to be consistent (and to make the task a bit more challenging), the two functions should run in sequence. But instead of staggering the Lambdas via two cron jobs, I chose to give AWS Step Functions a spin.

What you see on this slide must be one of the simplest state machines imaginable. While Step Functions is much more powerful, bare-bones orchestration was all I needed to get started.

To complete the picture, I created a CloudWatch Events rule for triggering the state machine to update the feed with the latest Dilbert strip every morning. The architecture diagram shows all involved components running inside my personal AWS account (for which I haven’t had to pay a cent so far, by the way).

You’re probably right to assume that dilbert-feed is over-engineered to some degree, and deliberately so. Remember that it’s just a fun side project, a playground where I can do whatever I want and try out new tools and practices whenever I feel like it.

Among other things, I used the project to explore different monitoring/observability solutions for serverless. In the end, I settled for a simple heartbeat Lambda that pings Healthchecks.io over HTTP as the last state machine step.

On Healthschecks.io, I’ve configured a check that sends me an email notification as soon as a ping doesn’t arrive on time. It can’t get much easier than that.

In true deploy-once-and-leave-the-rest-to-serverless manner, the setup has been humming along nicely for over two years without significant problems other than dilbert.com being down for maintenance. 💪

As you can see from this list, I learned a great deal from hacking on dilbert-feed on nights and weekends. In fact, it continues to benefit me as a playground and template for other side and freelance projects to this day.

[Shortly after giving the presentation in 2019, I started to embrace the wonderful AWS CDK. Again, all these experiments are open source.]

Before we wrap up, I want to briefly mention two more serverless projects I made for fun.

The first is a DynamoDB Store for LaunchDarkly, which provides the building blocks that, taken together, allow you to create a serverless flag storage pipeline. For more information, check out my presentation on Implementing Feature Flags in Serverless Environments.

Needless to say, this was a good opportunity for diving into DynamoDB.

Last but not least, I’ve been tinkering with a serverless version of Chaos Monkey. It’s still work-in-progress, but I hope to share more about it in the future. Suffice to say, Chaos Engineering is near and dear to my heart.

What do all these ventures have in common? They’re part of a positive feedback loop.

Investing time in side projects – or recreational programming in general, the “project” bit is optional – will improve my freelance work, either directly (craftsmanship) or indirectly (motivation). Conversely, the things I learn from consulting can have a regenerative effect on my side projects. It’s a virtuous cycle.

(That said, it’s totally okay and often advantageous if your hobbies have nothing to do with your work. I can only speak for myself.)

Let’s wrap things up. Here are the key takeaways for you:

Cloud projects don’t have to be big or great. Sometimes, it’s enough for them to be fun. For me, serverless is the very definition of fun.

Building things is fulfilling; servers are a distraction from what really matters: our beloved applications.

Serverless has helped me, a consultant who wrestles with Kubernetes by day, rediscover the joy of programming by night.

Serverless is an excellent choice for many endeavors, one of them being recreational programming.

Thank you. 🙏

Freelancer by Choice

2020-02-24T00:00:00+01:00

I’m a freelancer. To be more precise, I’ve been a freelancer for three years now.

I had never planned to escape the traditional 9-to-5 office job to work for myself, though. For ten years, I genuinely enjoyed being a full-time employee at a handful of different tech companies. I took on various roles from software developer to systems engineer to site reliability engineer. Life was good, sometimes great, following an ever-increasing up-and-to-the-right trajectory – or at least it felt that way.

The truth is, becoming a freelancer was initially just a side effect. You might go so far as to call it an accident. Both are true.

In March 2017, after two months of back and forth (I loved the job I had at the time), I decided to jump in the deep end and join a Silicon Valley startup as employee #2. Well, technically, I was just a remote contractor based in Germany. Regular employment was not an option as the company didn’t have an office in the EU. So I ended up signing a consulting agreement, submitting monthly invoices for my work – mostly Rust & Java programming, AWS infrastructure automation, and some technical writing. For this whole arrangement to work with reasonable effort, I had to go freelance.

Since becoming and being a freelancer in Germany involves a fair amount of bureaucracy, I decided early on that getting a good accountant would be a make-or-break deal for me. While I have a basic understanding of taxes, I couldn’t care less about the intricacies of accounting. I’d rather spend my time doing things I enjoy. Luckily, a freelance friend of mine recommended a competent firm, which has taken good care of all the paperwork so far.

The startup gig ultimately turned out to be a rather forgettable experience. I quit after only eight months. But even though it didn’t work out as planned, I don’t have any regrets – at least not today, with the benefit of hindsight. On the one hand, I learned a lot about startup culture, remote work, Chaos Engineering, and the business of SaaS. On the other hand, I also learned a lot – if not more – about myself:

I can work for myself. I’m capable of accomplishing things on my own. I somehow knew this before the startup job, but managing myself with the rest of the team nine time zones away was quite a different kind of challenge.
I’m a self-driven person, always eager to learn more and comfortable wearing multiple hats. And yet I also know that my ego can get in the way at times.
I’ve become an AWS expert who can program and write. That’s a precious combination of skills. As a result, I can charge top dollar for my hard-earned knowledge.
During my career, I’ve built a network of awesome people I can tap into for help on all sorts of issues ranging from technical questions to legal and business affairs.
I used to be too risk-averse. I blame it in part on growing up in former East Germany and not having entrepreneurial DNA in my family (it’s more complicated than that, of course). Fortunately, I’ve learned to be braver over time. Contracting for a startup was a bold move for me – and so was my early resignation.

So what did I do with this new-found awareness? I’d be lying if I said I didn’t go looking for another permanent job at first. I did – likely out of habit and laziness. However, nothing came out of it except for a dozen unexceptional interview experiences (yes, hiring is broken, but that’s not the point). It was frustrating and confusing, and it somehow didn’t feel right. I found myself struggling with my professional identity, facing questions I couldn’t answer easily: Who am I? How do I want to work? Employee or freelancer? And what about doing my own thing and becoming a co-founder, which was also an option at the time?

At some point, after weeks of contemplating, it struck me that I was actually given a rare opportunity. I realized that the startup gig was not, in fact, a failure, but rather the gentlest introduction to freelancing I could’ve asked for. Come to think of it, I learned most things a “normal” freelancer should know – without the unpleasant task of finding new clients (which, admittedly, is a critical skill to pick up sooner rather than later). I began to see the situation in a new light – as a stepping stone between full-time employment and being a freelancer, this time for real.

My first client came through a former coworker. So did the second and the third. To be honest, nearly all client relationships have started this way. I know someone (who knows someone) who trusts me to do great work. That’s where the network of wonderful people I mentioned earlier comes in.

What exactly do I do now? As a freelance cloud consultant, I specialize in designing and building robust serverless and container-based solutions for clients who want to take their businesses to the cloud. (Read more about my work here.)

The one thing I like the most about freelancing is, unsurprisingly, the incredible amount of freedom it offers. Scott Berkun perhaps put it best: “My chosen career requires some significant sacrifices, but a major benefit is on most days I answer to no one. When I need time for myself it’s there for me to take it.” Many employees would be surprised just how much freedom and flexibility this way of working can provide. In fact, it still astonishes me.

All things considered, I’m a happy consultant, grateful for the opportunities that have come my way so far – including the bad ones as they make the good ones possible. Granted, I didn’t have to survive lots of ups, downs, and evolutions of my business. I know I’m still winging it in some areas, but that’s okay. Freelancing can be a stable, long-term career. To quote Paul Jarvis, “As long as you’re doing great work that’s in demand, working for yourself has no limits.”

The new normal is that I work for myself. I’m proud to be a freelancer by choice.

The Limitations of Chaos Engineering

2018-01-03T00:00:00+01:00

Chaos Engineering, and fault injection in particular, is all the rage. Breaking things on purpose, rather than “by accident”, is what the cool kids do these days. It’s what they like to speak about at meetups and conferences or proudly promote on Twitter and their blogs. We’re starting to see job titles with the word “Chaos” in them, not unlike “DevOps” a couple of years ago (both meaningless, of course). Sarcasm aside, it’s evident that Chaos Engineering has become a technology trend, with more and more companies adopting it. While it might not have gone mainstream yet, we’re getting there for sure!

I’ve certainly contributed to the hype myself by publishing articles, giving presentations, and developing tools in the Chaos Engineering space. I even had a brief stint at a startup trying to sell “Failure as a service” – or “Resilience as a service”, one of the two – to enterprises. All in all, I think it’s fair to say that I’m an avid proponent of the practice. I believe that every Site Reliability Engineer should, at the very least, be familiar with the basics of proactive failure testing as a means to create better systems. (There’s an “R” in SRE for a reason.)

But here’s the rub: Being an advocate of something doesn’t mean you should close your eyes to its downsides and limitations. In fact, the most skilled engineers are well aware of the pros and cons of their favorite tool or method, and consider them carefully. I have the impression, however, that most discussions so far have focused almost exclusively on the advantages of Chaos Engineering – occasionally in a hyperbolic manner – without asking a lot of hard questions.

I, too, am guilty of this, as are many other Practitioners of Chaos (we definitely have the coolest names). This article is my attempt to take off those rose-colored glasses, at least for a moment, and put things into perspective. Believe it or not, Chaos Engineering does have its gotchas and limitations.

A Means to an End

First of all, it should be clear that Chaos Engineering is a means to an end, not an end in itself. Experimenting on a distributed system is of great worth, but what matters, in the end, is the production service you aim to improve in the first place. Breaking things is a ton of fun, I can attest to that, but as long as you don’t feed results back – by fixing flaws, tweaking runbooks, training people – your chaos experiments are rarely more than a time killer. And no, updating your mental models isn’t enough (that is unless you never forget anything and everyone has access to your brain). At the very least, write down any observations you make and follow up soon, if required.

Fault injection on its own won’t make your infrastructure more robust; people will. It should be obvious, but it’s not. Last year I reviewed an early draft of a book on Chaos Engineering. I was surprised to learn that there was no mention of any steps beyond the experimentation phase, as if that was the ultimate goal.

(A quick note: I’m well aware that fault injection and Chaos Engineering are not the same. When the latter comes up in practice, however, people almost always talk about inducing faults into distributed systems as an opportunity to learn, so cut me some slack here.)

One Step Forward, Two Steps Back

Even if you do the work, who’s to say that being able to uncover weaknesses will automatically lead to positive outcomes, like improved customer experience? As software developers know, identifying a bug and fixing it are two different challenges. Indeed, your optimization efforts in one area might increase brittleness in other areas, as David Woods points out:

[Expanding] a system’s ability to handle some additional perturbations, increases the systems vulnerability in other ways to other kinds of events. This is a fundamental trade-off for complex adaptive systems where becoming more optimal with respect to some variations, constraints, and disturbances increases brittleness in the face of variations, constraints, and disturbances that fall outside this set.

One Among Many

Chaos Engineering is not a remedy for all of your reliability concerns, and it never will be. It’s merely one of many approaches used to gain confidence in system correctness (typically in the face of perturbation). Consider it required but not sufficient. And by no means is it – or should it be – the only way to learn from failure. As John Allspaw notes in his article on fault injection in production:

[GameDay] exercises aren’t meant to discover how engineering teams handle working under time pressure with escalating and sometimes disorienting scenarios. That needs to come from the postmortems of actual incidents, not from handling faults that have been planned and designed for.

Proactive failure testing and post-incident reviews go hand in hand. As we will see next, it’s a mistake to assume that doing enough of the former makes up for neglecting the latter (and arguably vice versa). Besides, neither of the two methods is a substitute for, say, proper monitoring and unit testing. All these practices complement each other.

Too Brittle or Too Reliable

It goes without saying that no amount of Chaos Engineering will fix a Big Ball of Mud (when duct tape is holding the architecture together). Try to design for failure at all levels of your system. Address known issues before inviting more chaos into brittle infrastructure.

At the same time, don’t overdo it. Solve the business problem at hand. You’re not Google. Well, actually, Google is a bad example because they do know that systems can be too reliable:

If a system gets too reliable, then the team who runs it feels like they need to keep it that reliable, even though there are potential failure modes that are very expensive to mitigate.

Chaos Engineering must make good economic sense (remember: it’s a means to an end).

Systems Will Continue to Fail

I’ve argued before that negative visualization – imagining what could go wrong to prepare for disruption – is an essential part of Chaos Engineering. However, it’s also one of its major limitations, as Allspaw points out in his article mentioned above:

The faults and failure modes are contrived. They reflect the fault designer’s imagination and therefore can’t be comprehensive enough to guarantee the system’s safety. While any increase in confidence in the system’s resiliency is positive, it’s still just that: an increase, not a completion of perfect confidence. Any complex system can (and will) fail in surprising ways, no matter how many different types of faults you inject and recover from.

It may sound overly pessimistic, but while Chaos Engineering surely is a net plus, impermanence makes sure that all complex systems will fail no matter how hard we try to avoid it (which is exactly why postmortems are so important). The Holy Grail of Automation – introducing faults automatically instead of manually – won’t change that fact a bit. Don’t fool yourself and set realistic expectations.

Now I can almost hear you scream: “But there’s Lineage-driven fault injection! We don’t need to dream up any failure modes!” I hear you. LDFI is, without a doubt, a great achievement and I’m looking forward to seeing more implementations of it in the wild. However, everyone who has read the paper knows that LDFI comes with its own list of requirements and limitations, which is probably the reason why we haven’t heard much about it outside of academia so far.

The Paradox of Automated Fault Injection

Speaking of automation, there’s an interesting paradox worth highlighting. Again, quoting from Allspaw:

If the faults that are injected (even at random) are handled in a transparent and graceful way, then they can go unnoticed. You would think this was the goal: for failures not to matter whatsoever when they occur. This masking of failures, however, can result in the very complacency that they are intended (at least should be intended) to decrease. In other words, when you’ve got randomly generated and/or continual fault injection and recovery happening successfully, care must be taken to raise the detailed awareness that this is happening – when, how, where, etc. Otherwise, the failures themselves become another component that increases complexity in the system while still having limitations to their functionality (because they are still contrived and therefore insufficient).

We like to talk at length about how visibility is key to detecting issues during chaos experiments – and maybe aborting experiments to prevent further harm – but let’s not forget building awareness of what’s going on in the first place.

Infinite Variability

I recently found a paper on resilience that, among other things, tries to illustrate that antifragility does not exist, at least not in the sense of systems being universally antifragile. Here’s the relevant passage:

In general, for systems subjected to variability, noise, shocks and other random perturbations, it is possible to develop strategies that, on average, benefit from variability, but not any variability. Such strategies are designed to profit from the variability of particular stressors. Simultaneously, they are vulnerable to other stressors.

Whether you believe in antifragility or not (I have my doubts), the variability argument is spot-on. No system can withstand all turbulent conditions, but only specific ones – some of them thanks to Chaos Engineering. The somewhat worn-out vaccination analogy drives the point home: We inject something harmful into a system to build an immunity to it, where “it” refers to the particular “disease” we want to fight. While we might be better off overall, nobody and nothing ends up being invincible this way.

We can also look at the problem of variability in terms of state space, as Caitie McCaffrey has done for us:

Fault-injection testing forces [failures] to occur and allows engineers to observe and measure how the system under test behaves. If failures do not occur, this does not guarantee that a system is correct since the entire state space of failures has not been exercised.

You Can’t Have a Rollback Button

Last but not least, let’s look at a seemingly innocuous implementation detail.

The rollback button is a lie. That’s not only true for application deployments but also for fault injection, as both face the same fundamental problem: state. Yes, you might be able to revert the direct impact of non-destructive faults, which can be as simple as stopping to generate CPU/memory/disk load or deleting traffic rules created to simulate network conditions. But no, you can’t roll back what has been inflicted on everything else in the system – the targeted application and everything that interacts with it. A prime example is corrupt or incorrect data stored in a database/queue/cache due to a program crash.

One might argue that it is the very goal of chaos experiments to reveal such weaknesses, and that’s exactly right. However, having a rollback/revert button promising to quickly get back to safety is, strictly speaking, a scam. That isn’t to say we should do away with these safeguards entirely. It just means we should stop implying that all, or even most, actions can be quickly reversed, which may cause engineers to take more risk than necessary.

Wrapping Up

And we each forget, every day, how much care we need to take when using our seemingly benign tools; they are so useful and so sharp.
– Michael Harris, The End of Absence

Chaos Engineering has many benefits – I can’t imagine a tech world without it – but all that glitters is not gold. Please stop pretending that it is.

Antifragility 101

2017-09-21T00:00:00+02:00

Antifragility.

If you’ve been wondering what this term means, you’re not alone. To be honest, I also had a hard time understanding the concept of antifragility, in particular how it compares to resilience. Fortunately, I know that writing – and the research that goes along with it – are perfect for both gaining and sharing knowledge, so I put this article together.

Resilience by Example

To understand antifragility, I think it’s helpful to understand resilience first. Here’s one definition of resilience (there are different definitions, but let’s stick with this one for antifragility to make any sense):

A system is resilient if it can adjust its functioning prior to, during, or following events (changes, disturbances, and opportunities), and thereby sustain required operations under both expected and unexpected conditions.
– Erik Hollnagel

Resilience is something a system does, not something a system has.

There are many well-known patterns to build web systems that are resilient to certain kinds of failures. We use auto scaling groups, for example, for clusters to maintain a minimum number of EC2 instances (server capacity) to absorb disturbances and continue serving user requests. If an auto scaling group considers an instance to be unhealthy, it will automatically terminate that instance and launch a replacement. The infrastructure can “heal itself” by recovering from (some but not all) failures. That’s one example of resilience as defined above.

With that out of the way, let’s take a first look at antifragility and how it’s different from resilience.

The Antifragile Gets Better

The Wikipedia page on Antifragility is a good start if you want to learn the very basics. Here are some of the more interesting bits (emphasis mine):

“Antifragility is a property of systems that increase in capability, resilience, or robustness as a result of stressors, shocks, volatility, noise, mistakes, faults, attacks, or failures. It is a concept developed by Professor Nassim Nicholas Taleb in his book, Antifragile and in technical papers.”
“Antifragility is fundamentally different from the concepts of resiliency (i.e. the ability to recover from failure) and robustness (that is, the ability to resist failure).”
Taleb explains the differences this way: “Antifragility is beyond resilience or robustness. The resilient resists shocks and stays the same; the antifragile gets better.”

In the auto scaling example above, the system is able to restore the desired capacity shortly after losing a server instance – it resists the shock and stays the same.

So far so good, but how, as Taleb claims, do antifragile systems get better? In other words, how do they benefit from disturbances?

One of the few practical examples of antifragility I understood intuitively is related to our body. The human body is an antifragile system because it gets better – faster and stronger – through physical training. It will adapt to the stress of exercise with increased fitness if the stress is above a certain threshold, but not too high either (a process called adaptation).

While that example is relatable, I still had trouble applying the idea of antifragility to areas like software development and web operations.

Potential Downside < Potential Upside

Eager to learn more about antifragility in the context of DevOps, I read Antifragile Systems and Teams by Dave Zwieback. This short report turned out to be the best summary of the topic I’ve seen so far. I highly recommend reading it (especially if you can’t stand Taleb’s writing style).

“The main property of antifragile systems”, Dave writes, “is that the potential downside due to stress (and its retinue) is lower than the potential upside, up to a point.”

Some examples from the report:

“Vaccination makes a population antifragile because the downside (a small number of individuals having negative side effects) is small in comparison to the upside (an entire population gaining immunity to a disease).”
“[With BitTorrent] the more our file is requested, the more robust to failure and available it becomes because parts of it are stored on a progressively larger number of computers. […] our cost of distributing this file would remain constant – not so for the cost of making systems more robust to anticipate higher demand or improve resiliency.”
“[The potential downside of frequent deployments] is smaller than the potential upside. […] customers receive higher-quality products and services (i.e., value) faster and at a lower cost than is possible with traditional, risk- and volatility-averse approaches.” (Put another way: If it hurts, do it more often.)

Dave goes on to show how the layers of DevOps (culture, automation, measurement, and sharing) can contribute to the antifragility of organizations, concluding that there’s “significant overlap in practices of DevOps organizations and those that seek the benefits of antifragility”. After all, “DevOps embraces and makes use of disorder, randomness, and, ultimately, impermanence”.

Chaos Engineering

Speaking of embracing impermanence: if it’s possible for systems to benefit from shocks – to become more robust as a result – the idea of injecting faults on purpose suddenly doesn’t sound so crazy anymore, right?

In fact, there’s a discipline called Chaos Engineering centered around this idea. Michael Nygard, the author of Release It!, put it well in this comment:

Chaos engineering is a technique to create antifragility. That is, if you evolve toward systems that survive that kind of chaos, then your systems will exhibit antifragility.

However, one caveat: antifragility is not a universal or omnidimensional characteristic. Chaos engineering causes your system to evolve toward antifragility toward those kind of stresses.

Antifragile systems might benefit from variability, but not any variability. A system can’t be universally antifragile similar to how it can’t resist any failure.

Example: Chaos monkey kills EC2 instances. In response, you build autoscaling, masterless clusters. That helps when machines die, but not when whole regions die. Or when DNS fails. Or when data gets corrupted. Or when the marketplace changes. Etc.

The potential downside of Chaos Engineering (occasional service interruptions) is smaller than the potential upside (better overall customer experience), up to a point (experiments causing severe damage that affect customers). While I don’t believe that web infrastructure itself can be antifragile (I might be wrong), it seems plausible to say that Chaos Engineering creates antifragility by enabling teams to improve their infrastructure through experimentation.

So, is antifragility a useful concept? I honestly still don’t know what to think of it. Writing this article led to some insightful conversations that made me question most things I believe to know about resilience. Among other things, I learned that antifragility might be superfluous depending on what definition of resilience you use. I therefore almost decided against publishing. However, I also realized that I’m still learning, that this piece is part of my journey. I promise it won’t be my last take on the topic.

Impermanence: The Single Root Cause

2017-08-25T00:00:00+02:00

If you google for “impermanence”, you will learn that the word means “lack of permanence or continued duration”. This definition is admittedly very vague and very boring. If you look a bit further though, you will discover that impermanence is also the name of an essential doctrine of Buddhism. The doctrine says that “all temporal things, whether material or mental, are compounded objects in a continuous change of condition, subject to decline and destruction”. Well, I don’t believe in Buddha, or Jesus for that matter. But I do strive to understand complex systems (and how they fail). If religion might help me just a little bit in that regard, I’m more than willing to listen.

Let’s try to interpret the doctrine, starting with the core statement:

All things are compounded objects in a continuous change of condition.

Compounded objects are objects formed by combining two or more parts. In complex systems like web systems, it’s safe to assume that all things are compounded. Every hardware component and every piece of software consists of multiple parts.

Given this insight, allow me to take some mental leaps:

All systems are in a continuous change of condition.

All systems are changeable by nature.

And finally, to close the loop:

All systems are impermanent.

In fact, constant change is a prerequisite for systems to function – to fulfill their purpose.

Despite their catchy name, even immutable servers aren’t truly immutable (sorry). For servers to do anything useful – processing HTTP requests, streaming log messages, rendering cat pictures – countless changes have to take place in both soft- and hardware. Change is indispensable.

On a related note, we practice Chaos Engineering to learn something new about our systems by deliberately imposing change on them. In a sense, we embrace the fact that all systems are impermanent.

But why is this understanding useful? Why should you care?

At this point, it’s time to admit that I first read about impermanence in Dave Zwieback’s superb book, Beyond Blame, which led me to write the article, Learning From Failure and Success Through Postmortems. He not only taught me about the meaning of the word but also, and more importantly, introduced me to this powerful idea:

Impermanence is the single root cause for all failures and successes.

In Beyond Blame, Dave explains it as follows: “The root cause for both the functioning and malfunctions in all complex systems is impermanence (i.e., the fact that all systems are changeable by nature). Knowing the root cause, we no longer seek it, and instead look for the many conditions that allowed a particular situation to manifest. We accept that not all conditions are knowable or fixable.”

Let that sink in for a minute. (We’ll get to the details in a bit.)

Eager to learn more, I also read Dave’s free report, Antifragile Systems and Teams, which devotes the first chapter to a brief summary of impermanence, this time going into more detail.

The report starts by repeating the core idea – “systems start, stop, or continue working” due to their “changeable, impermanent nature”, which is the single root cause – and goes on to explain why this theoretical understanding is indeed useful: “it reminds us that all functioning systems will eventually break down”. That knowledge, in turn, “frees us from looking for the ‘single root cause’ of outages, and from the mistaken belief that there is none”. (As Sidney Dekker famously put it: “What you call root cause is simply the place where you stop looking any further.”)

Having accepted impermanence, we might be tempted to blame it for each and every incident. Doing so, however, would be a mistake, depriving us of the opportunity to learn from failure. Besides, we are engineers! As Dave rightly observes, we “cannot accept that things break or function entirely randomly”. I certainly can’t. And rather than giving up our profession and going shopping, we should try hard to “identify [at least] some of the conditions” contributing to the success and failure of our systems, i.e., “conditions that we can actually impact” such as infrastructure design or collaboration in the workplace.

By finding and fixing those conditions – potentially through postmortems – we’re able to improve our organizations and systems in a meaningful way.

In conclusion, we need to stop wasting our time looking for the single root cause. Impermanence is the one cause of all functioning systems and all outages. Period. We should rather focus on the conditions leading to both good and bad situations.

A Primer on Automating Chaos (Gremlin)

2017-08-09T00:00:00+02:00

Learning From Failure and Success Through Postmortems

2017-07-28T00:00:00+02:00

Yin and yang – failure and success.

The other day, when I was listening to the Beats, Rye & Types podcast, I noticed this sharp statement, which I had to jot down immediately:

With the traditional methods of dealing with failure… you can get to a certain threshold of safety – and you hit sort of a plateau beyond which you cannot go. In order for you to go beyond that, you have to start approaching safety in this new way, basically focussing on learning, removing blame, building healthy organizations.

It was no other than Dave Zwieback, the author of Beyond Blame: Learning From Failure and Success, who said this on the show. He’s one of the best-known proponents of using blameless postmortems – or learning reviews, as he prefers to call them – to address fragility within complex systems and organizations. The interview with Dave made me want to read his book, which turned out to be an excellent decision. Despite being a short read (69 pages of content), it’s a masterpiece packed with valuable insights – some of them entirely new to me.

Learning through postmortems

Learning from the past is very difficult, yet it’s precisely what we need to do for our organizations to succeed. If we look a bit more closely, there’s no shortage of opportunities to learn from the past – server outages are a prime example – provided that we stop the finger-pointing and “go beyond blame and punishment”, as Dave puts it.

With the perfect vision afforded by hindsight, we can spend a lot of time ruminating on what we could or should have done. That is counterproductive; the past is past. We need to acknowledge and learn from our mistakes, and move forward, focusing instead on what we will do now and in the future.

Fixing the technical issues that manifest during outages is important, but not sufficient. We also need a structured process to truly learn from these events. That’s what postmortems are for. They help us understand why incidents happen and how to prepare our company – “outages are symptoms of trouble somewhere deeper in our organization” – and the systems we run for the future.

Indeed, I can say without exaggeration that conducting postmortems has been one of the most rewarding – though initially uncomfortable – experiences of my engineering career. Postmortem conversations can be tense when stakes are high. It’s also hard work to “mentally transport ourselves to the past” while we’re under the influence of cognitive biases (we always are). Hindsight bias, in particular, remains an obstacle to incident investigation, making it impossible to assess human performance accurately after the fact.

“If only Mike didn’t troubleshoot the router,” for example, is not describing what actually happened, and instead of learning from the past, we’re engaging in a kind of lazy (but very comforting) wishful thinking. “Mike could have asked for help,” or “Mike should have done more testing in the lab,” or “Mike didn’t do the right thing,” are all counterfactuals, and are all evidence of hindsight bias.

All actions are gambles

A closely related cognitive error is outcome bias, which refers to the tendency to judge a decision by its eventual outcome (which isn’t known at decision time, of course). The same routine command we’ve used hundreds of times in the past – one that might have saved the day more than once – would suddenly crash the server because the system has drifted into failure, unnoticed by anyone. Now we’re no longer the hero; we’re just some careless cowboy administrator. That’s outcome bias at work.

It’s important to understand that every outcome, successful or not, is the result of a gamble. In his influential paper How Complex Systems Fail, Richard Cook observes the following:

[In complex systems] all practitioner actions are actually gambles, that is, acts that take place in the face of uncertain outcomes. The degree of uncertainty may change from moment to moment. That practitioner actions are gambles appears clear after accidents; in general, post hoc analysis regards these gambles as poor ones. But the converse: that successful outcomes are also the result of gambles; is not widely appreciated.

Why is that so hard to comprehend? Beyond Blame has one answer:

It’s a reflection of our misunderstanding of how complex systems function… Failure is a normal part of complex systems, yet it’s always so surprising when they fail. Why aren’t we more surprised when they function? … In complex systems failure is absolutely normal and expected. Malfunction is as ‘normal’ as ‘regular’ functioning.

We get most things right

Accepting that malfunction and “normal operation” are part of complex systems is one thing; actually learning from both failures and successes is another. I agree with Dave that postmortems tend to place too much focus on the former while neglecting the latter:

We also want to learn not just what went wrong… but what went right – what usually goes right. We’re typically overly focused on failures, forgetting that the same systems – including the people working in them – produce both positive and negative outcomes. Mostly positive, in fact – we certainly don’t have outages every hour or even every day!

Given the tech industry’s obsession with failure, I find this encouraging. We get most things right. Something to remember.

To wrap this up, here’s what we should do to prevent future incidents, which are inevitable, as best as we can:

Learning from both failures and successes. Feeding these learnings as signals back into the system, which will change and adapt to this new information. That’s why air travel has become as safe as it is over time – every time there is an accident or near-accident, it’s investigated, and the results are fed back into the system. This system includes air craft, traffic control, weather, engineers, and so on.

I couldn’t have said it better myself.

What I Learned From Hacking Video Games

2017-07-12T00:00:00+02:00

In high school and later in university I had what most people would consider a rather odd hobby: video game hacking. More precisely, I was into creating cheat codes for Playstation games that would, for example, make you invincible against enemy attacks or allow you to drive through walls in your favorite racing game. However, I’m not talking about the kinds of cheat codes that can be activated by entering secret passwords or pressing controller buttons in a particular sequence. I’m talking about modifying the game’s data and program code at runtime by reverse engineering its mechanics.

In the late 90s and early 2000s, you could buy off-the-shelf cheat devices for Playstation 1 (PSX) and Playstation 2 (PS2), like the Xploder and CodeBreaker. Those devices worked by injecting a small program into memory between booting the console and starting a game. That program would then be running in the game’s background with full access to its memory. It was relatively easy for cheat devices to “freeze” things like energy points, ammo, money, etc. by constantly writing the same values to certain memory addresses (protections such as ASLR were unheard of at the time).

Cheat devices like the Xploder/Xplorer could be connected to the parallel I/O port of the PSX

Best of all, the first cheat devices for PSX let you find your own cheat codes. This process boiled down to dumping the game’s RAM to a PC (via printer cable) where it could be searched and compared with other dumps by special “trainer” software. We did all of this on – hold your breath – Windows 98. There was a hex editor with a live view of the game’s memory; you could manipulate bytes and see the effect instantly. You could also set breakpoints, which was by far my favorite feature as it made for very sophisticated cheat codes – so-called assembly hacks that would overwrite game logic. Over the years, I’ve created thousands of codes this way.

The powerful X-Link trainer software for PSX

When the golden age of PSX hacking was over, something bad happened: all of a sudden, companies stopped sharing their tools and started encrypting everything. It was unfortunate, but at the same time, it provided an opportunity for me to improve my limited programming and reverse engineering skills. So it happened that we wrote our own software for hacking PS2 games. Along the way, we cracked about anything we could get our hands on, from proprietary encryption schemes to anti-cheat protections.

All of this was a ton of fun. As a matter of fact, it’s how I got into computers and programming in the first place. But it’s more than that, much more. Here are 10 things I learned from hacking video games, most of them only obvious in hindsight:

Programming. I taught myself the C programming language to write my first “proper” hacking tools. All the cool kids were using it for low-level system stuff. This was also when I became a fan of command-line tools. (These days, I dig Go and Rust, probably for similar reasons.)
Bits and bytes. Video game hacking helped me understand: hexadecimal numerals, Boolean algebra, binary formats such as ELF, memory layout of executables, interrupts, breakpoints, assemblers, disassemblers, hex editors, and a ton more.
Cryptography. Poking around the Xploder PSX code encryption was my first experience in cryptography. I figured out where the device stored the unencrypted codes in RAM and did a dumb brute-force attack to break the different schemes. Since then, I’ve been fascinated by crypto.
Networking basics. The PS2 comes with an Ethernet port/adapter, which is what we used to transfer memory dumps to a PC using a simple client-server implementation. I had to learn the basics of TCP/IP, DNS, network programming, etc. to achieve that.
GNU/Linux. Writing code for game consoles usually involves cross-compiling under a Linux-y environment. That’s how I was introduced to gcc, bash, make, grep, and other GNU utilities – first on Windows via Cygwin and MinGW, later on Ubuntu.
Open source. I remember how a hacker called Parasyte made the source code of his tools available to the public. I learned so much from his C code that I started open sourcing my programs in the same way. Homebrew made sure I always had a plethora of side projects going on. I also enjoyed the social side of open source and bulletin boards.
Version control. Subversion, Mercurial, Git. I tried them all. If I remember correctly, I was among the first who moved homebrew over to GitHub. Using GitHub and other platforms also sparked my interest in writing good documentation.
Curiosity. I believe that my strong desire to know or learn something partially stems from tinkering with video games back in the days.
Freedom. Another important motivator. I deliver my best work when I have the freedom to accomplish tasks my way. I learned this early on. But there’s also freedom in the sense that I was convinced that game hacking had to be free. No code encryption. No proprietary formats. An end to control.
Grit. I used to spend weeks reverse engineering a video game or cheat device. Today, I still enjoy solving tricky technical problems over a long period – without losing myself in details.

I didn’t put this list together as a means to brag – it’s not that impressive anyway. No, I mainly wrote this for nostalgia (remembering the good old days) and introspection (trying to understand myself better).

While times have changed – I barely pick up a game controller anymore – I probably wouldn’t be the software engineer I am today if it wasn’t for game hacking. In fact, I may not even be a programmer at all. It still amazes me how such a geek hobby ultimately turned into a lifelong passion and career.

In 2009, I did a Q&A with the founder of GameHacking.org. While some of the discussed topics might not make sense to someone who has never hacked Playstation games, I think the interview still provides a bit more context. Back in the days, I used the nickname misfire. Naming is hard.

Embracing Failure in a Container World

2017-06-28T00:00:00+02:00

What follows is the text of my presentation, Embracing Failure in a Container World that I gave at ContainerDays in Hamburg this year. There’s no recording available, so I figured it would be fun to turn the presentation into an article. I edited the text slightly for readability and added some links for more context. You can find the original slides here.

Hi, and welcome to my talk, “Embracing Failure in a Container World”.

Today I want to show you some practices and tools you can use to make your container systems more resilient to failures.

My name is Mathias.

I’m @mlafeldt on Twitter, GitHub, and pretty much anywhere else on the internet.

I live here in Hamburg and I work remotely for a US startup called Gremlin Inc.

We obviously like to break things on purpose, but I will tell you more about Gremlin at the end of the talk.

Right now, let’s talk about a fun topic…

Outages.

Here are three of the better-known ones from this year:

GitLab’s infamous “team-member-1” accidentally removed a folder on the wrong server, resulting in a long database outage.
We probably all noticed the major S3 incident earlier this year, which also affected other AWS services like EC2.
Last but not least, a recent server outage halted sales at many Starbucks stores in the US, but at least they gave out free coffee.

I bet you also suffered from other outages not listed here.

But what’s the lesson?

The lesson is that sooner or later, all complex systems will fail.

There will always be something that can – and will – go wrong. No matter how hard we try, we can’t build perfect software, nor can the companies we depend on.

Even S3, which has had a long track record of availability, will fail in surprising ways.

I found this quote from Henning’s talk to fit in nicely here. That’s why I turned it into a slide.

Speaking of complex systems that fail, “There’s always something with Docker in production.”

It’s funny because it’s true.

So we live in this imperfect world – things break all the time, that’s just how it is. All we can do is accept it and focus on the things we can control: creating a quality product or service that is resilient to failures.

Add redundancies, use auto scaling, gracefully degrade whenever possible, decrease coupling between system components – those are well-known design patterns to make systems more resilient to failures.

Well, at least that’s the theory. Building robust systems in practice is a lot harder, of course. How do you actually know you’re prepared for the worst in production?

Sure, you can learn from outages after the fact. That’s what postmortems are for. Postmortems are awesome, I’m a big fan. However, learning it the hard way shouldn’t be the only way to acquire operational knowledge.

Waiting for things to break in production is not an option.

But what’s the alternative?

The alternative is to break things on purpose. And Chaos Engineering is one particular approach to doing just that. The idea of Chaos Engineering is to be proactive – to inject failures before they happen in production.

Intentionally terminate cluster machines, delete database tables, inject network latency. Be creative. These actions help us verify that our infrastructure can cope with these failures, and to fix it otherwise.

However, Chaos Engineering is not only about testing assumptions, but it’s also about learning new things about our systems, like discovering hidden dependencies between components.

Chaos Engineering was originally formalized by Netflix. Check out their website – principlesofchaos.org – for more details.

Before we move on, let me give you a bit more context.

It’s fair to say that I learned most of the things I know about web infrastructure in my four years at Jimdo, especially when I was part of the team responsible for Wonderland, which is Jimdo’s internal PaaS.

In fact, we gave a presentation about Wonderland at ContainerDays last year. I want to spare you the details today. It’s enough to say that Wonderland uses ECS under the hood, which is Amazon’s cluster scheduler for Docker containers.

Also, Wonderland powers 100% of Jimdo’s production infrastructure. It’s not just some toy Docker project; it’s the real thing.

Now back to Chaos Engineering.

One thing we did to implement Chaos Engineering at Jimdo was to run so-called GameDay exercises on a regular basis.

On a typical GameDay, we would…

Gather the whole team in front of a big screen.
Think up failure modes and estimate the impact of those failures. For example, we asked ourselves, “What would happen if we terminated 5 cluster instances at once or if this critical microservice went down?”
Go through all chaos experiments together, breaking things on purpose just as we planned.
Write down the measured impact.
Create follow-up tickets for all flaws uncovered this way.

I can assure you that we found a number of issues every single time, many caused by missing timeouts and unexpected dependencies, and also some bugs in open source software we were using.

To put it in a nutshell, GameDays are great. They helped us improve Wonderland to the point where we had to test PagerDuty during GameDays to find out if our monitoring was still working – because alerts were so rare.

One particular tool we used during GameDays was Chaos Monkey.

Who of you has used Chaos Monkey before? And in production? (Asking the audience. For the record, only Jimdo employees raised their hands twice.)

Chaos Monkey was built by Netflix to terminate EC2 instances randomly during business hours. The goal is for the infrastructure to survive terminations without any customer impact.

I think you all agree that it’s better to test this proactively in the office than at 4 am in the night when shit really hits the fan.

Unfortunately, the configuration of Chaos Monkey is a bit complex. It’s a Java program that has dozens and dozens of settings. That was one problem, the other one was deployment.

Since we already had this amazing container platform at Jimdo, I decided to dockerize Chaos Monkey and solve the configuration issue at the same time.

This Docker image is the result. We ended up using it to run one monkey per environment (one in staging, one in production), which gave us a solid foundation for running chaos experiments on GameDays.

To give you an example, this shows the basic usage of the Docker image.

We pass the required AWS credentials to the image and instruct it to consider all EC2 auto scaling groups of the corresponding AWS account for termination.

When running, Chaos Monkey will randomly pick one auto scaling group and terminate one instance of that group. That’s how it works.

Chaos Monkey usually runs continuously, killing an instance every hour or so. However, we only needed it during GameDays or whenever we had to do some resilience testing.

Luckily, Chaos Monkey also comes with a REST API that allows us to terminate instances on demand. And I wrote a command-line tool in Go to make use of that API.

Again, here’s an example.

We first install the chaosmonkey command-line tool using go get.

We then tell it where Chaos Monkey is running, what auto scaling group we want to target, and how the instance should be terminated (Chaos Monkey supports different chaos strategies, with “shutdown” being the default).

We can also be fancy and terminate multiple instances in a row, which is helpful to find out when a cluster loses quorum, for example.

The tool can do a couple more things not shown here. I recommend checking out the README on GitHub for more information.

One last tip on Chaos Monkey:

It’s possible to hook it up to Slack so that your team is aware of ongoing chaos experiments. This slide shows an example message from one of our Slack channels.

For obvious reasons, visibility is important when doing Chaos Engineering.

So far, we talked about Chaos Monkey and how it can be used to terminate instances. Of course, terminating hosts is only one way to inject failures into your system. There’s certainly a lot you can learn from this exercise, but there are also many more places where your infrastructure can fail.

We are at ContainerDays, so I guess we’re interested in impacting not only the hosts where our containers are running on but also the containers themselves, right?

This is where Pumba comes in.

Pumba is an open source tool designed for injecting failures into running Docker containers. Just as Chaos Monkey, you can use it to simulate real-world events. For example, you can kill specific containers or inject network errors into them.

Here’s an example showing Pumba in action.

We first start a test Ubuntu container. We then instruct Pumba to delay all outgoing network traffic from that container for 60 seconds. That’s basically how you can simulate latency to external services used by an application running in that container.

Now you might wonder how this works under the hood.

Internally, Pumba uses a tool called tc that talks to the traffic shaping API of the Linux kernel. However, it would be impractical to bake tc into every application container you want to attack.

So what Pumba does is spawn a sidecar container, which has tc preinstalled. And it will spawn that sidecar container in the network namespace of the target Ubuntu container. That’s what the --net option does here.

When done, Pumba will do the same thing to remove the traffic rule from the kernel again.

Pretty magical, but works well in practice.

At the beginning of the talk, I promised you to tell you more about Gremlin Inc.

So what do we do?

Gremlin, our product, provides failure as a service. It’s another powerful tool for resilience testing.

You can use Gremlin to attack hosts and containers. In the future, it will also be possible to attack applications directly through application-level fault injection.

We support a variety of attacks that you can run from our Web UI or the command line. One unique feature of Gremlin is that it can safely revert all impact. We also provide security features like auditing and access control out of the box.

Gremlin is currently in closed beta. If you want to give it a try, talk to me, and I will send you an invite.

Let’s wrap this up. What are the takeaways from this talk?

The main takeaway is that building resilient systems requires experience with failure. Don’t wait for things to break in production, but rather inject failure proactively in a controlled way.

Use one or more of the many chaos tools available today. Use Chaos Monkey, use Pumba, use Gremlin – whatever works for you.

Please keep in mind that it’s important to start small! Don’t wreak havoc on production from day one and tell your boss it was my idea. Instead, start by experimenting with a virtual machine or a staging environment. Then slowly ramp up your testing efforts.

For those of you who have enjoyed this presentation and want to learn more about Chaos Engineering or SRE in general, I wrote a lot of articles on these topics in the last year. Check out my Production Ready mailing list and feel free to talk to me about anything afterward.

Thank you.

Go, Mental Models, and Side Effects

2017-06-14T00:00:00+02:00

I recently wrote about my struggle of learning Rust and Java at the same time. Since then, a lot of people have asked me if I finally came to grips with both programming languages. The short answer: yes and no. Yes, I do enjoy building things in Rust, although its compile times still demand considerable patience. And no, if I were given a choice, I wouldn’t use Java for my daily work, mainly because it feels heavy and tends to encourage programmers to write overly verbose code with too much abstraction, which goes against one of my core values.

One thing I’ve learned over the years – sometimes the hard way – is that learning a new language usually doesn’t happen in a vacuum. More often than not, there already exists a body of code, infrastructure, conventions, team values, etc. you need to grasp if you want to contribute in a meaningful way. This has led me to believe that the degree to which one enjoys a programming language invariably depends on the given environment. In other words, if you don’t like language X, it would be too easy – and certainly unfair – to put the sole blame on language X. Look around; X is only part of the equation.

That caveat aside, however, I still believe that some programming languages make it easier to write maintainable code than others. One excellent example is Go.

The virtues of Go

Go is a straightforward, no-frills programming language that is almost non-magical. You might even call Go boring, but being boring is actually a good thing when it comes to technology. After all, software should behave predictably and accomplish its goals without too many surprises. Code readability – and maintainability – first, language features second. What good is the latter without the former?

Go’s tooling is mature (okay, maybe except for dependency management). There’s a vast amount of working libraries, making it a perfect fit for developing delightful command-line tools, lightning-fast web servers, and robust distributed systems. Betting on Go, which is a breeze to learn, makes it easier for businesses to hire and onboard new developers (in particular if onboarding only involves telling the new employee where the source code can be found, but I digress).

Having used Go successfully in production for years, I could go on and on and on, but I’ll stop here.

Of course, Go is not a panacea. You can still write terrible code in it. More than once did I end up in interface hell when trying to navigate some Go code. Peter Bourgon also rightly notes that despite being a non-magical language, there are still a few ways magic can creep in through the use of global state – beware of unpredictable side effects!

Yet I’d argue that, by and large, Go’s explicitness and lack of fancy language features make it harder for programmers to create a Big Ball of Mud. Go is relatively easy to read, understand, and reason about. In other words, Go minimizes the effort required to build a mental model of a program.

What’s a mental model? I’m glad you’re asking.

Mental models 101

[A mental model] is a representation of the surrounding world, the relationships between its various parts and a person’s intuitive perception about his or her own acts and their consequences. Mental models can help shape behavior and set an approach to solving problems and doing tasks.

That’s what Wikipedia says. Here’s another explanation, this one specific to programming:

At a fundamental level, all software describes changes in the state of a system over time. Because the number states and state transitions in software can have combinatorial complexity, programmers necessarily rely on approximations of system behavior (called mental models) during development. The intent of a mental model is to allow programmers to reason accurately about the behavior of a system.

We all carry different, more or less accurate images of how something works in our heads, be it the weather or cars or code.

This goes for all engineering teams. Alice might be the one who knows the most about the new distributed cron solution (written in Go, of course). Hence she possesses the most comprehensive model of this particular component and how it interacts with other components. However, when it comes to service deployments and secrets management, Bob and Ted might have a clearer picture of what’s going on.

Systems blindness

Systems are invisible to our eyes. The best thing we can do is understand them indirectly through mental models, and then perform actions based on these models. Unfortunately, these models tend to be incomplete – or just plain wrong. Blame complexity: the more complex a system, the more difficult it becomes for our brain to build a correct mental model of it.

Remember that I warned you about global state in Go? Well, in actuality, there are no side effects, just effects that result from our flawed understanding of the system.

Bugs occur because of an incomplete mental model:

Since mental models are approximations, they are sometimes incorrect, leading to aberrant behavior in software when it is developed on top of faulty assumptions…. The most sinister bugs occur when programmers falsely believe their mental models to be complete…. Throwing away the mental model is crucial to forming a sound hypothesis [when debugging].

Good programming languages can reduce the scope of the mental model developers must maintain, affecting both programming and debugging in a positive way. I believe Go meets the criteria. I miss using it on a daily basis.

Getting Things Right With Checklists

2017-06-01T00:00:00+02:00

I’ve always been a very organized person, sometimes to an obsessive degree (don’t touch my stuff!). I like being on top of things. I don’t want to miss any steps when carrying out a task. Naturally, checklists have long been one of my favorite tools to get things done.

When it comes to software development, for example, it’s hard to beat GitHub’s task lists. To plan and prioritize personal tasks, I also like to write checklists on paper. And of course, there are plenty apps out there to choose from, all based on the simple but powerful idea of the checklist.

But although I’ve used different checklists for many years and firmly believe in their effectiveness, I realized that my knowledge of them was superficial at best. That’s why I decided to read The Checklist Manifesto by Atul Gawande, which was recommended to me multiple times.

The book is about Gawande’s journey in designing a checklist for surgery. He, quite surprisingly, draws most lessons from aviation and construction, where checklists are standard practice, and applies them to medicine. He ultimately succeeds (spoiler alert). The final 19-item Surgical Safety Checklist has gone on to show a significant reduction in the number of complications and deaths by decreasing errors and increasing teamwork in surgery.

The following is a summary of what I learned from reading The Checklist Manifesto.

Man is fallible

The first question that comes to mind is why we need checklists in the first place?

The short answer: we humans need checklists because we are fallible. More precisely, we fail at what we set out to do for three reasons:

Necessary fallibility. Even enhanced by technology, some things are simply beyond our capacity and will remain outside our understanding and control.
Ignorance. We make mistakes because we don’t know enough about the world and how it works. We still can’t predict the weather accurately. There are still diseases we cannot cure.
Ineptitude. We fail to apply existing knowledge correctly. Despite our best efforts, we write buggy code. We construct buildings that collapse.

In today’s complex world, our main problem is ineptitude. Gawande writes:

science has filled in enough knowledge to make ineptitude as much our struggle as ignorance. […] Know-how and sophistication have increased remarkably across almost all our realms of endeavor, and as a result so has our struggle to deliver on them.

He continues to depict in great detail how medicine in particular “has become the art of managing extreme complexity”:

To save this one child, scores of people had to carry out thousands of steps correctly: placing the heart-pump tubing into her without letting in air bubbles; maintaining the sterility of her lines, her open chest, the exposed fluid in her brain; keeping a temperamental battery of machines up and running. The degree of difficulty in any one of these steps is substantial. Then you must add the difficulties of orchestrating them in the right sequence, with nothing dropped, leaving some room for improvisation, but not too much.

Getting things right is becoming harder and harder every day – even for specialists who have received intense training. We need a different strategy.

Checklists.

Checklists do not only compensate for the limits of our memory and attention, but they also lead to higher performance by forcing us to be disciplined:

people can lull themselves into skipping steps even when they remember them. In complex processes, after all, certain steps don’t always matter. […] Checklists seem to provide protection against such failures. They remind us of the minimum necessary steps and make them explicit. They not only offer the possibility of verification but also instill a kind of discipline of higher performance.

Teamwork

So checklists help us apply the knowledge we have consistently and correctly. However, that by itself is not enough. Gawande continues:

the volume and complexity of what we know has exceeded our individual ability to deliver its benefits correctly, safely, or reliably. Knowledge has both saved us and burdened us.

No one can do everything anymore, neither in aviation, nor in construction, nor in any other complex environment. The Genius Master Builder is dead. Besides, not everything can be reduced to a simple recipe.

But who says that a checklist only contains, say, tasks for constructing a building? In addition to decreasing errors, checklists can also be used to increase teamwork and communication in order to overcome individual weaknesses and deal with unexpected problems as a team. These communication checklists are used in construction and elsewhere to force specialists to talk to each other:

While no one could anticipate all the problems, [experts] could foresee where and when they might occur. The checklist therefore detailed who had to talk to whom, by which date, and about what aspect of construction – who had to share (or “submit”) particular kinds of information before the next steps could proceed.

Above all, the goal of checklists is to embrace a culture of teamwork and discipline. Complexity requires group success.

What makes a good checklist?

Now that we know the “why” of checklists, let’s take a look at the “how”. What exactly makes a good checklist?

According to the book, good checklists are:

Precise, efficient, to the point
Reminders of the most important steps, not comprehensive how-to guides
Between five and nine items long (the limit of working memory)
Quick and simple tools to support the skills of experts
Practical – tested in the real world
Easy to use even in difficult situations
Written in simple and exact language, using familiar terms of the profession
Frequently revisited and refined to help rather than hinder

Unless the moment is obvious, we also need to define pause points at which a checklist is supposed to be used in a process. We can choose between two options: a DO-CONFIRM checklist (perform tasks from memory, then stop to verify) or a READ-DO checklist (carry out tasks as you check them off). Both have pros and cons.

Adoption

To reap the benefits of checklists, we need to be willing to adopt them as part of our daily work and, ultimately, our company culture. Checklists alone cannot make anyone follow them.

As Gawande knows all too well, we will meet resistance when introducing a checklist at a larger scale. Besides eating your own dog food, he came to the conclusion that the first people using a new checklist should “have the seniority and patience to make the necessary modifications and not dismiss the whole enterprise”. Sounds reasonable.

Some people may still object that checklists are merely about ticking boxes, as this passage from the book points out well:

It somehow feels beneath us to use a checklist, an embarrassment. It runs counter to deeply held beliefs about how the truly great among us – those we aspire to be – handle situations of high stakes and complexity. The truly great are daring. They improvise. They do not have protocols and checklists. Maybe our idea of heroism needs updating.

It’s important to understand that a checklist isn’t just a protocol you’re supposed to follow mindlessly. It is rather supporting us in our work:

The checklist gets the dumb stuff out of the way, the routines your brain shouldn’t have to occupy itself with (Are the elevator controls set? Did the patient get her antibiotics on time? Did the managers sell all their shares? Is everyone on the same page here?), and lets it rise above to focus on the hard stuff (Where should we land?).

Even with checklists, there’s still plenty of room for individual judgment and performance.

A production-readiness checklist

Of course, I wouldn’t write this article if there wasn’t a direct relationship to software development and running systems in production. Web systems are also among the complex environments where we can – and should – use checklists for greater efficiency, consistency, and safety.

I want to wrap this up with a practical example. While runbooks make for great checklists in web operations, I found a different example by reading yet another book.

In Production-Ready Microservices, Susan J. Fowler provides a useful checklist to decide whether a microservice is ready for production or not. According to Fowler, a production-ready service must be:

Stable and reliable
Scalable and performant
Fault tolerant and prepared for any catastrophe
Properly monitored
Documented and understood

For each item, the production-readiness checklist defines specific criteria that must be met. For example, this is what it takes for a microservice to be fault tolerant (3):

It has no single point of failure.
All failure scenarios and possible catastrophes have been identified.
It is tested for resiliency through code testing, load testing, and chaos testing.
Failure detection and remediation has been automated.
There are standardized incident and outage procedures in place within the microservice development team and across the organization.

I think it’s a fantastic idea to create production-readiness checklists for system components if your goal is to build standardized systems across an engineering organization. For my part, I’m planning to go deeper into the topic and come up with some checklists of my own.

Our jobs aren’t too complicated to reduce to a checklist. In fact, we are more likely to fail if we don’t try.

Premortems: The Art of Negative Visualization

2017-05-17T00:00:00+02:00

I’ve become a big fan of Ryan Holiday’s work. It comes as no surprise that one of his books, Ego Is The Enemy, has inspired me to write Every Day We Must Sweep a couple of weeks ago. I couldn’t help it and also read the critically acclaimed predecessor, The Obstacle Is The Way, shortly after. It too turned out to be a remarkable book, drawing on the vast experience of Emperor Marcus Aurelius and other famous historical figures. Again, I highlighted dozens and dozens of passages I considered to be noteworthy. One particular chapter stood out to me, though. It’s called “Anticipation (Thinking Negatively)” and it has a direct, almost disturbing relevance to my daily work. If it were possible, I would give you a copy of it right now, but I’m afraid my interpretation and lots of quotes will have to do for the time being.

The chapter begins with a discussion of premortems, which are said to be popular in the business world. The basic idea of a premortem is that managers encourage employees to think negatively – in terms of worst-case scenarios – when preparing for major events like a product launch. Holiday writes:

In [a premortem], we look to envision what could go wrong, what will go wrong, in advance, before we start. Far too many ambitious undertakings fail for preventable reasons. Far too many people don’t have a backup plan because they refuse to consider that something might not go exactly as they wish.

Astute readers will notice the play on words here. Premortem is, of course, the opposite of postmortem. With a postmortem, we’re examining something after it happened so that we can learn and improve for the next time a similar situation occurs. In the tech world, postmortems are the ultimate tool to learn from server outages and other failures.

Being a software engineer myself, I had to smile when reading about “premortems” for the first time. At that moment, however, I also realized that this seemingly odd term resembles what I do for a living like few other words – but read on.

According to Holiday, this practice – this form of negative visualization – can be attributed to ancient Stoic philosophers:

A writer like Seneca would begin by reviewing or rehearsing his plans, say, to take a trip. And then he would go over, in his head (or in writing), the things that could go wrong or prevent it from happening: a storm could arise, the captain could fall ill, the ship could be attacked by pirates.

Always prepared for disruption, always working that disruption into our plans. Fitted, as they say, for defeat or victory. And let’s be honest, a pleasant surprise is a lot better than an unpleasant one.

Rehearsing plans.

Imagining what could go wrong.

Preparing for disruption.

Given my background in web operations and my current job, guess what immediately came to my mind? Call me crazy, but it was the idea of breaking things on purpose – imagining and simulating potential errors in advance – in order to build systems that are resilient to failures.

Holiday continues:

Your world is ruled by external factors. Promises aren’t kept. […] We are dependent on other people. […] The only guarantee, ever, is that things will go wrong. The only thing we can use to mitigate this is anticipation.

All complex systems will fail. There will always be something that can – and will – go wrong. From self-inflicted outages caused by bad configuration or buggy images to events outside our control like denial-of-service attacks or network failures.

Anticipation is key: hoping for the best, preparing for the worst. (I don’t want to sound overly pessimistic, but this maxim must exist for a reason.)

As a result of our anticipation, we understand the range of potential outcomes and know that they are not all good (they rarely are). We can accommodate ourselves to any of them. We understand that it could possibly all go wrong. And now we can get back to the task at hand. […] We are prepared for failure and ready for success.

By being well prepared, we will not be caught by surprise and won’t be disappointed. And even if there’s nothing we can do about an outage or similar events, we could use that as a practice to manage our expectations. Because sometimes the only way out is through.

Turns out the Stoics knew more about web technology than you’d think.

The Discipline of Chaos Engineering (Gremlin)

2017-05-03T00:00:00+02:00

There's nothing like a good spike

2017-04-19T00:00:00+02:00

I’m a big proponent of spike solutions. A spike is a simple end-to-end solution to a technical problem. Spikes are cheap – and often dirty – implementations that are meant to be thrown away after exploration. It’s fine to create multiple spikes to explore different directions when dealing with a tough engineering challenge. Spikes should only be concerned with the problem at hand, independent of existing code, best practices, and similar ceremony. That means, for the first time in your career, you’re encouraged to write some lousy shell scripts to get the job done. I do it all the time – no one will blame you.

Spikes are all about reducing risk by reducing the number of unknowns. Spikes help you verify that you’re on the right track, that what you’ve imagined is indeed possible given the constraints you face. Spikes also make for more accurate estimates of development costs. Having trouble predicting how long a feature is going to take? Or if it’s feasible at all? Create a spike, and your decisions will be all the wiser.

The great thing about spikes is that they can prove you wrong before you’ve wasted a lot of resources trying to build The Real Thing™. They help you overcome perfectionism by making sure you don’t lose yourself in details right from the get-go. No wonder spikes are listed among the Rules of Extreme Programming.

That’s enough praising for now. Let me give you a real-world example. Last year, when I was still working at Jimdo, we were looking for a more reliable service for running periodic batch jobs inside our PaaS. One of the candidates that caught my attention was Nomad, the cluster scheduler developed by HashiCorp.

As the Lean Enterprise book has taught us, we should only spend time on automation for products or features once they have been validated. Anything else is wasteful. Rather than rushing to automate Nomad’s setup in AWS on day one (which later turned out to take a couple of weeks), I decided to create a spike first. Before adding yet another tool to our stack and taking on the burden of operating it, I wanted to learn more about Nomad’s capabilities and figure out if it was the right choice for us.

As I mentioned, we were looking for a reliable cron solution. The spike’s purpose was to convince us that Nomad was at least worth a closer look. To that end, I asked myself: what would be the simplest setup to achieve that goal?

I ended up doing some local testing with Vagrant based on a demo that ships with Nomad. For the spike, I started a Nomad server and two Nomad clients inside the virtual machine managed by Vagrant. I then created a minimal periodic job that would send a message to a Slack channel every minute. Based on this quick experiment, I was able to learn more about Nomad’s mechanics and test its resilience to different kinds of injected failures, like killing one or both clients. (During testing, I actually encountered a serious bug in Nomad causing successful batch jobs to be run again after restarting a stopped client. Luckily, that bug had been fixed in master the day before…)

All in all, the spike was a success. We continued doing more experiments in AWS, gradually going from a one node setup to operating Nomad in a highly-available fashion.

In a sense, spikes are low-risk experiments – not unlike Chaos Engineering experiments – to validate assumptions early on. I consider them an essential part of my software development toolbox.

Next time you hit a roadblock, a spike or two might be all you need to move forward.

Every Day We Must Sweep

2017-04-05T00:00:00+02:00

I’m in the midst of learning two new programming languages: Java and Rust.

I’m struggling, and it’s my fault.

At Gremlin, the company I joined a month ago, we use Java for our API service in the backend and Rust for our client, which injects infrastructure failures, and also Rust for our daemon, which communicates with the API.

It’s been a while since I felt this stupid when looking into a new programming language, let alone two that are so radically different.

To be honest, I’m not completely new to either Java or Rust. I wrote some Java Card programs in my first job after university (Java Card is a small Java runtime for running crypto code on SIM cards). That was almost ten years ago. Since then, I’ve made my fair share of jokes about Java, covering topics like software bloat, enterprise readiness, and – of course – desktop updates. Recently, it all came back to bite me when it literally took me days to get started with Eclipse, Gradle, dependency management, and whatnot.

I also have an on-off relationship with Rust, the systems programming language praised for being super fast and super safe. Before Gremlin, I successfully rewrote two of my C tools in Rust. However, I never really learned the language and its myriad of programming concepts. Yes, Rust lets you control almost everything, but as a beginner, it’s also hard to compile anything. As a result, Rust continues to make me look bad. (By the way, compile times are ridiculously slow compared to Go, which I’ve used for the last four years.)

So apparently, I don’t particularly like Java and Rust, and this is just another rant about programming languages? Actually, no, not at all.

Ego is the enemy

Some days ago, I finished reading Ego Is The Enemy by Ryan Holiday. As the title says, the book is devoted to the treacherous nature of ego. It draws on a vast array of stories of people who didn’t let ego control their actions and decisions on their road to success. On the other hand, the book contains just as many tales of individuals who lost the inner battle against ego.

Here are a few of the quotes I highlighted on my Kindle:

If ego is the voice that tells us we’re better than we really are, we can say ego inhibits true success by preventing a direct and honest connection to the world around us.

As success arrives, like it does for a team that has just won a championship, ego begins to toy with our minds and weaken the will that made us win in the first place.

We must […] continue working on what got us here. Because that’s the only thing that will keep us here.

Just because you did something once, doesn’t mean you’ll be able to do it successfully forever. Reversals and regressions are as much a part of the cycle of life as anything else.

The problem is that when we get our identity tied up in our work, we worry that any kind of failure will then say something bad about us as a person.

Daniele Bolelli once gave me a helpful metaphor. He explained that training was like sweeping the floor. Just because we’ve done it once, doesn’t mean the floor is clean forever. Every day the dust comes back. Every day we must sweep.

What in the world has this to do with me learning Java and Rust? Everything.

Being a beginner again

Starting a new gig is a huge decision, in part because you have to learn so many new things and relearn stuff you had long forgotten. Sure, no one can take away your hard-earned experience. However, there are other areas, like programming languages, where you have to start over, where you aren’t an expert (yet).

Being a beginner again is hard. I know it all too well. It can be frustrating. I’m used to producing quality work in a short time span, but projects that took hours now take days. I keep asking my (patient and helpful) coworkers about the most basic things, which often makes me feel outright dumb. Am I too hard on myself? Probably.

Reading Holiday’s brilliant book helped me find the true cause of my struggle: ego. It is my ego that keeps telling me that I’m better than this, that my past performance is a guarantee of future success, that I’m too professional to be playing in the amateur league again.

The truth is I’m all for learning new things. As an engineer, I want to be able to contribute to Gremlin’s codebase. I want to write Java and Rust code that is used in production. And yes, I do seek the uncomfortable. In fact, that’s one of the reasons why I quit my last job.

To that end, I don’t want ego to get in my way. I know I need to discard any preconceived notions about programming languages I’ve never actually used for an extended period (and which are only a means to an end anyway).

Despite the little experience I have at this point, I must admit that Java 8 streams are pretty neat. I like the refactoring and debugging capabilities of modern-day Java IDEs. Rust reminded me of my game hacking days more than any other language. Cargo, Rust’s package manager, is terrific. I’m sure there are many more amazing things to learn – but only if I manage to leave my ego at the door and continue working on what got me here in the first place.

Every day the dust comes back. Every day we must sweep.

Breaking Things on Purpose (Gremlin)

2017-03-16T00:00:00+01:00

Sometimes Boring Is Better

2017-03-08T00:00:00+01:00

The recent announcement of Docker Enterprise Edition brought back some bad memories. After having used Docker in production for two years, I honestly don’t have a lot of faith in its overall stability and direction – enterprise or not. CoreOS failed to ship a stable Docker version on multiple occasions despite their efforts to provide a production-grade Container Linux. Throw ecs-agent into the mix, and you’re guaranteed to have a lot of headaches due to broken/blocked cluster updates. Once bitten, twice shy.

To be fair, using a supposedly stable tool or OS does not relieve you from doing your own sanity testing. As engineers know, many problems only manifest themselves at scale in production (which is why verifying things in staging only works to a certain degree). Also, remember that we’re dealing with complex web systems, which are prone to failure.

So yes, we should cut Docker, CoreOS, and Amazon ECS some slack. All of them have their place. All of them are by themselves groundbreaking technologies that, taken together, enable us to do amazing things, like running a company-wide PaaS for production services. No, I don’t blame them. In fact, I’m glad they exist. On the other hand, they’re still good examples for making the following point.

What do Docker, CoreOS, and ECS have in common? All three are relatively new technologies. Some might even call them “bleeding edge” (I won’t). In any case, all three are the opposite of boring – they’re rather hip and shiny. The point of this article is that, when it comes to technology, sometimes boring is actually better.

Over the last couple of months, I’ve read a number of articles on the merits of choosing boring technology. Dan McKinley’s article is without a doubt one of the best pieces on the topic. It’s worth reading from beginning to end, but here are some of my favorite quotes:

The nice thing about boringness (so constrained) is that the capabilities of these things are well understood. But more importantly, their failure modes are well understood. […] But for shiny new technology the magnitude of unknown unknowns is significantly larger, and this is important.

In other words, software that has been around for a decade is well understood and has fewer unknowns. Fewer unknowns mean less operational overhead, which is a good thing.

One of the most worthwhile exercises I recommend here is to consider how you would solve your immediate problem without adding anything new. […] It’s helpful to write down exactly what it is about the current stack that makes solving the problem prohibitively expensive and difficult.

New systems mean new problems, so we should think twice before adding anything new to an otherwise boring and well-understood stack.

set clear expectations about migrating old functionality to the new system. The policy should typically be “we’re committed to migrating,” with a proposed timeline. The intention of this step is to keep wreckage at manageable levels, and to avoid proliferating locally-optimal solutions.

Timeboxing migrations is an excellent idea I probably should have applied a couple of times in the past. As for locally-optimal solutions, like using new technology X in a single place without good reason, John Allspaw had the following to say in an interview about Etsy:

we want to exploit the advantages of having a relatively finite number of well-known tools. […] the advantages of being more optimal do not outweigh the advantages of using the same language [PHP] a lot. […] In the same way, there’s a massive advantage in using a default data store, MySQL.

He points out that if something breaks for some reason, each new tool is another thing an engineer has to understand, making it more difficult for the company to be resilient:

The thing is, when you pull something shiny and new off the shelf, there can be operational overhead. If it breaks and you’re the only one who knows how it works, then it probably wasn’t a great technical choice. […] We want to plan for a world where stuff breaks all the time. And we want to make it so that when things break they matter a lot less, that they’re not critical. That they break and we can fix them and we can adapt and be resilient.

I, too, believe that being a bit more conservative and slowing down the pace would benefit our industry.

Skyliner, the AWS launch platform, is certainly a prime example of this philosophy:

Skyliner doesn’t use registries, scheduling, service discovery, virtualized networking, or any other advanced features. Instead, we use AWS services with proven reliability like S3, Autoscaling, and Elastic Load Balancing – services which have seen almost a decade of continuous use and improvement.

As software developers, we understand the allure of shiny new technologies, but ultimately we decided that we prefer the quiet satisfaction of sleeping through a night while on call. After all, we’re not just building a platform for our customers – we run our applications on Skyliner, too.

So, should we give up and stop using advanced container technologies altogether? Absolutely not. What we need, first and foremost, is a simple, boring container implementation that just works. More generally speaking, what we need are stable building blocks. Everything on top – our production systems – will flourish from there.

Docker and friends aren’t boring yet, but eventually they will be. I’m looking forward to that day.

The Pros and Cons of Eating Your Own Dog Food

2017-02-22T00:00:00+01:00

In the software industry, eating your own dog food – or dogfooding – is a common approach for companies to test their product or service by letting employees use it in real-life scenarios. The idea is that by using your own software just as a customer would, you can proactively validate and incrementally improve it before releasing it to the world. Besides quality control, relying on your own product also makes for good marketing.

One of my favorite books, The Year Without Pants by Scott Berkun, contains this memorable account of dogfooding, which I found to be worth quoting in length (I added line breaks for readability):

It turned out the adjustment was easy. The Internet Explorer team at Microsoft had an equivalent, called the daily build, where we released a version of the software every day, but it was available exclusively inside the company. Each day all the changes from the previous day were compiled and released, and everyone was expected to install and use them. This gave us regular feedback on the quality of what we were making, including nuggets of joy, or moments of misery, when new features were added.

On good days, the builds were high quality, and we called those releases self-host, as in “safe to host on your computer.” Builds that were mediocre were called self-test, suggesting you install it only on a test computer (or a coworker’s when the person wasn’t looking). The worst builds were called self-toast, meaning you’d destroy whatever machine you had dared installed it on. Whenever we had three days in a row with self-toast builds, all new work stopped until we got the build quality up to a good level (a measure to prevent the project from digging a dangerously deep quality hole for itself).

Shipping on WordPress.com was the same philosophy, just accelerated and made public to customers. I didn’t find the lack of bigger plans or schedules a problem. In fact, it was mostly liberating.

It’s true that dogfooding lends itself more to certain products than others. It doesn’t take a lot of imagination to see, for example, how it can be used effectively in these cases:

Using Internet Explorer, a browser, to build Internet Explorer
Using Basecamp, a project management tool, to build Basecamp
Using GitHub, a code hosting/collaboration platform, to build GitHub

In each example, the product in question is a tool engineers would need for everyday software development anyway – a perfect fit for eating your own dog food. For organizations/teams who don’t build user-facing products similar to the ones listed above, dogfooding might be limited to more constrained test scenarios, or it might be impossible to implement for the system as a whole.

I’ve spent the last 18 months of my career working on a project that’s suitable for dogfooding like no other before it. Building and operating Wonderland, Jimdo’s in-house PaaS, has provided us with the unique opportunity to run most of the platform’s services on the platform itself – basically everything that can be deployed via Docker.

We don’t throw our work over to QA and wait for bug reports (there’s no QA department anyway). More often than not, we are the first to feel the joy and, more importantly, feel the pain of using our PaaS in production. Dogfooding gives us invaluable feedback on both released features and work in progress; it helps us to detect many issues before our users do.

To some degree, dogfooding has also been useful for anticipating wishes and evaluating ideas expressed by other teams at Jimdo (our customers). However, it’s always a challenge when we don’t have a real need for a specific feature. For example, back when we added the ability to run periodic jobs in Wonderland, our team didn’t have a use case for them at first, so we created a somewhat contrived cron that posted some stats in a Slack channel. That’s not ideal.

While we’re the first users of our PaaS, we’re by no means the largest. That means we might not be the first team running into scaling issues. That’s unfortunate, but it’s also a matter of costs. I’m confident that, for instance, load tests in production would come in useful here. Speaking of testing, dogfooding is no substitute for traditional automated and usability testing. Nor is it an excuse for depending on internal systems only.

There’s also the danger that developers may have knowledge to make software work that a normal user will lack. In fact, it’s common for SaaS products to expose tons of system internals when being used in “developer mode” by its creators. That’s certainly useful but also something to be aware of.

Despite these disadvantages, I’m convinced that those who develop software should ideally be the first ones to use it on a day-to-day basis.

Try hard to make your software a major part of your workflow. Put yourself in the shoes of existing or potential users as often as you can. Feed results back into the product’s design and code. Rinse, repeat. This way, overall quality and usability are very likely to increase over time.

Perfectionism and Programming

2017-02-08T00:00:00+01:00

I consider myself a perfectionist.

When I’m writing a program, I want my code to be correct. It has to get the job done. At the same time, I strive to produce code that is perfectly formatted, fast, and (to me) beautiful. That in itself wouldn’t be a problem if all I did was to follow the wisdom of Kent Beck to “first make it work, then make it right, and, finally, make it fast.” Kent is right, of course. His advice is entirely reasonable. However, and here is the catch, it only works if you know when to stop.

There’s nothing wrong with refactoring and optimizing code per se – in fact, it’s essential to programming. But trying to Make It Right™ at any cost can be dangerous. I know this first hand. After spending countless nights in front of my computer, hacking on open source projects, I know what it’s like to get lost in implementing yet another tweak – a minor change that is supposed to make a difference. I know how it feels like to rewrite the same piece of code over and over again, with no end in sight. It’s frustrating, to say the least.

Writing

I don’t strive for the best in everything I do, but writing is another creative endeavor where perfectionism used to get in my way.

The biggest problem I face [with publishing content] is that I’m a perfectionist. I have a hard time writing shitty first drafts and postponing editing until after getting something down on paper first. Instead, I often try to get it “right” the first time, thereby making the writing process unnecessarily painful. The number of blog posts I’ve published this year is evidence enough of my struggle.

I wrote this in September 2014. These days I still suck at drafting. Writing continues to be a painful process from time to time. While I’ve figured out how to publish articles on a regular basis – the not-so-secret secret is writing consistently – I’m fully aware of the fact that perfectionism is a formidable obstacle to getting things done.

I like the way Anne Lamott put it in her great book, Bird by Bird:

Perfectionism is the voice of the oppressor, the enemy of the people. It will keep you cramped and insane your whole life, and it is the main obstacle between you and a shitty first draft. […] Perfectionism is a mean, frozen form of idealism

Good enough

Back to perfectionism and programming. Before you open your favorite editor and attempt to achieve an unattainable ideal, consider the following:

Unless you’re building low-latency trading systems or similar software where every microsecond counts, you probably don’t need to write the most efficient code ever conceived. Don’t fall prey to premature optimization. And unless you have a strong reason to believe that your current design is inadequate, chances are that another layer of perfectly crafted abstraction won’t save the day. Moreover, there is no need to thoroughly document every method of your code unless you’re providing an API that is actually used by somebody.

When I work on some piece of code and I feel the urge for perfection creeping in my head, I like to ask myself these questions:

Does this change really make a difference? Is it worth my time? Consider engineering costs vs. value created. Think long- and short-term.
Does it provide value to the users of my software? Users typically don’t care about internals like program code.
Does it matter to my coworkers? My boss? My future self?

By all means, I don’t want you to ship crappy code. But I do want you to remember that you have a choice – and more often than not, settling for good enough is a valid choice to make in a tech world that values shipping above all else.

At this point, I feel obliged to remind you that the best code is no code at all. Perfectionist or not, always start by looking for solutions that don’t involve writing code.

Tools and practices

If you’re reading this, you’re probably a programmer yourself. I guess you’re looking for some advice that’s a bit more concrete. I hear you. Here are six things that help me overcome perfectionism in programming:

Tooling. Remember that I want my code to be perfectly formatted? For me, tools like gofmt are a godsend. Never do I have to worry about whitespace or other formatting issues again. Instead, I can turn my attention to more interesting tasks.
Pair programming. I’ve experienced that the desire to finish work increases when working in pairs, in particular, if the work seems to be too hard or too boring or both. (Pair programming offers much more benefits.)
Code reviews. Besides pair programming, requesting a code review and getting an “LGTM” from my colleagues is another helpful indicator that my changes are good enough.
Testing. For me, perfectionism is closely related to confidence, the confidence in the code I write. Passing tests certainly increase the level of confidence in my work, provided I know the tests themselves are valuable.
Spikes. Rather than losing myself in details by trying to Make It Right™ from the get-go, I prefer creating a spike first. A spike is a simple (and dirty) end-to-end solution to a given problem, which is meant to be thrown away after exploration.
Deadlines. Having a due date, ideally imposed by someone else, helps me overcome perfectionism and procrastination. Going back to writing for a second: to keep on schedule, I had to press the button for publishing the very post you’re reading.

Reading over this list again, it becomes apparent that it’s not only about fighting perfectionism, it’s also a set of established practices for producing quality software. I call this a win-win.

This piece was inspired by the spot-on post, Terrors of perfectionism. Among other things, it helped me realize how often we consider a solution to a problem to be flawed and therefore temporary. Then we end up using it in production until doomsday.

Implementing Semantic Monitoring

2017-01-25T00:00:00+01:00

2016 was an exciting year. I don’t think there was another period in the almost ten years I’ve been in the software industry where I learned so much about web infrastructure. Building and running Jimdo’s PaaS has taught me a thing or two about reliability, scalability, usability, and other software “-ilities”. Today I want to write about one particular monitoring/testing strategy that has been invaluable to the success of our PaaS.

The difficulty of monitoring microservices

Our platform is composed of two dozen microservices sitting on top of Amazon Web Services (rumor has it we’re one of the largest users of ECS). When deploying a service via our CLI tool, the API request first hits AWS API Gateway, which will forward it to the corresponding microservice – our Deployer API in this case. The Deployer then enqueues a new job that will be picked up by a worker, which in turn deploys the service with the support of other microservices.

Breaking a system up into smaller services has a lot of benefits, but it also makes it more complex to monitor that the system as a whole is working correctly. For example, just because the services themselves report to be healthy doesn’t necessarily mean the integration points between them are fine too.

To verify that the Deployer works as expected, we execute a number of unit and integration tests as part of the CI pipeline. Besides, all of our services are fronted by a load balancer and backend replicas are replaced automatically at runtime when they become unhealthy. If that doesn’t help, PagerDuty will send us a friendly alert.

So we run some tests before deploying a new version of a microservice into production, and we use health checks to ping the service while it’s doing its job, e.g. processing API requests from impatient users. That’s a common way to validate production services. Unfortunately, it’s also a missed opportunity.

There are several problems with this approach:

We run the service-specific test suite only once and stop using it altogether when the service goes into production.
Apart from one-off integration tests, we don’t test the interaction between microservices continually.
Compared to CI tests, health check endpoints are usually dumb, often merely indicating if a service is running at all.
Additional low-level metrics like CPU utilization or response time are useful to pinpoint the cause of trouble, but they won’t give us a holistic view either.

There’s obviously a lot of room for improvement here. This is where semantic monitoring comes in.

Semantic monitoring combines test execution and realtime monitoring to continuously verify the behavior of applications. It lends itself in particular to validating microservices and how they interact at runtime.

Implementing semantic monitoring

How to implement semantic monitoring? The short answer: by feeding the results of end-to-end tests (consumer-driven contracts) into your existing monitoring solution. Those tests typically mimic user actions via fake events, e.g. a synthetic deployment, to ensure that the system behaves semantically (hence the name, semantic monitoring). What follows is a more or less detailed summary of the implementation we’re using at Jimdo.

At the heart of our setup is a set of black box tests. Those tests, which are written in Go, communicate directly with our API, just as users would do. Among other things, we have tests to ensure services and periodic jobs can be deployed and deleted successfully, both in staging and in production.

Due to the distributed nature of such systems, most tests boil down to “X should do A, B, C within T minutes”. As one API request can create a chain of downstream calls and events that are handled asynchronously, it’s a good idea to pass along a correlation ID. We output the unique resource IDs created during testing to be able to trace events through our systems should something go wrong.

To execute those tests continuously, we’ve configured a Jenkins job called “System-Tests”, which will run every hour and notify us about any failures in Slack. This, in fact, used to be the whole story for a long time: a couple of black box tests executed by Jenkins and Slack notifications that were easy to miss. That is, until Prometheus entered the picture.

At some point last year, we committed to using Prometheus for monitoring all the things from cluster instances to microservices running on our PaaS. And with Prometheus came the Pushgateway, which allows ephemeral and batch jobs to expose metrics in an easy way.

We created a generic Docker image to push a “freshness” metric to the Pushgateway. This metric – a timestamp plus some labels – can be used to determine whether a job was executed within a certain period or not.

We then added it to the System-Tests Jenkins job to write a freshness metric before and after running the tests:

push_build_metrics() {
  docker run -t --rm -e PUSHGATEWAY="$PUSHGATEWAY_ADDR" \
    quay.io/jimdo/freshness job=Jenkins \
    name=System-Tests branch=$BRANCH state=$1
}
push_build_metrics started
make test
push_build_metrics success

Afterward, we configured Prometheus to alert us if the job failed for three times in a row (at this point, we don’t trust Jenkins enough for this check to trigger PagerDuty at night):

ALERT SystemTestsFailing
  IF time() - freshness{name="System-Tests",branch="master",state="success"} > 3.5*60*60
  LABELS {
    wonderland_env = "prod"
  }
  ANNOTATIONS {
    summary = "System-Tests Jenkins job wasn't successful for 3 hourly runs in a row."
  }

And that’s the story of how we ended up implementing semantic monitoring on the cheap, based on building blocks already in place. All of the mentioned components for testing and monitoring can be used on their own. But by combining them, we can merge two separate but important verification techniques to monitor not only our microservices but also the integration points between them.

(If you want to learn more about monitoring microservices, I highly recommend reading Building Microservices by Sam Newman.)

Using Chaos Monkey whenever you feel like it

2017-01-11T00:00:00+01:00

The main idea of Chaos Engineering, we recall, is to trigger failures proactively in a controlled way to gain confidence that our production systems can withstand those failures. Chaos Engineering enables us to verify that our systems behave as we expect – and to fix them if they don’t.

In a previous article, I showed you how to use Chaos Monkey for automating your first chaos experiment. For this purpose, I’ve created a customizable Docker image of the Simian Army (which Chaos Monkey is part of) as a solid foundation for running experiments.

Netflix originally designed Chaos Monkey to terminate EC2 instances randomly during business hours. To that end, the tool comes with a good deal of configuration settings to control frequency, probability, type of terminations, and a lot more.

However, you don’t need to automate experiments to run continuously in order to benefit from Chaos Engineering (but doing so may further increase confidence in your systems). At Jimdo, we don’t use Chaos Monkey in a conventional way. In fact, the service is idle most of the time, waiting for instructions from us.

On-demand termination

Many people don’t know that, in addition to scheduled instance terminations, Chaos Monkey also supports killing instances on demand via its built-in REST API. So instead of inflicting chaos on your servers at random, e.g. once an hour between 10am and 5pm, it’s on you to decide if and when the monkey will perform a destructive action, without having to follow any imposed schedule.

Want to give it a try? With the mentioned Docker image, setup is a breeze. This single command will start a Chaos Monkey that is ready to take API requests on port 8080 (don’t worry about scheduled terminations – they’re deactivated in this example):

docker run -it --rm -p 8080:8080 \
    -e SIMIANARMY_CLIENT_AWS_ACCOUNTKEY=$AWS_ACCESS_KEY_ID \
    -e SIMIANARMY_CLIENT_AWS_SECRETKEY=$AWS_SECRET_ACCESS_KEY \
    -e SIMIANARMY_CLIENT_AWS_REGION=$AWS_REGION \
    -e SIMIANARMY_CHAOS_LEASHED=false \
    -e SIMIANARMY_CHAOS_ASG_ENABLED=false \
    -e SIMIANARMY_CHAOS_TERMINATEONDEMAND_ENABLED=true \
    mlafeldt/simianarmy

Afterward, you can talk to the API to trigger and retrieve instance terminations, or “chaos events”, as they are called here. For example, this will give you a list of past chaos events:

curl http://$DOCKER_HOST_IP:8080/simianarmy/api/v1/chaos

Triggering failures via the API is a bit more involved, and I won’t go into the details here. Instead, I’d like to promote a handy command-line tool I’ve written for that purpose.

CLI goodness

Say hello to the chaosmonkey CLI tool. (Not to be confused with the same-named binary that comes with Chaos Monkey v2. I had the idea first, I swear.)

Originally developed for controlled failure injection during GameDays at Jimdo, the tool could also be described as “Chaos Monkey whenever you feel like it”.

Let’s use it to send some commands to the Chaos Monkey we just started with docker run. First, and most important, here’s how to trigger a new chaos event. This will block all network access to a random instance of the given EC2 auto scaling group:

chaosmonkey -endpoint http://$DOCKER_HOST_IP:8080 \
    -group ExampleAutoScalingGroup \
    -strategy BlockAllNetworkTraffic

Sometimes it’s also convenient to terminate multiple instances of an auto scaling group, e.g. to test under what conditions a cluster loses quorum. In this example, we’re going to shut down three cluster instances at intervals of 30 seconds:

chaosmonkey -endpoint http://$DOCKER_HOST_IP:8080 \
    -group ExampleAutoScalingGroup \
    -strategy ShutdownInstance \
    -count 3 -interval 30s

It’s also straightforward to list past chaos events as Chaos Monkey keeps track of everything:

chaosmonkey -endpoint http://$DOCKER_HOST_IP:8080

There are a couple more features not shown here for brevity. The AWS integration, for instance, allows you to list the auto scaling groups for a given AWS account and to wipe Chaos Monkey’s state if you want to start over. I encourage you to read the documentation for further details, including installation instructions.

Chaos Monkey at Jimdo

Now that I told you about the REST API and the CLI tool, I want to share how we deploy and run Chaos Monkey at Jimdo. Here are the facts:

Chaos Monkey is just another service running on Jimdo’s PaaS
For deployment, we use the Docker image mentioned above
We run one monkey in production and one in our staging environment
We use :8080/simianarmy/ for the HTTP health check
A Nginx auth proxy protects the API endpoint (which is public on our platform)
For high availability, we deploy two service replicas behind an ELB (this works because we only use the REST API, no scheduled terminations)
We get Slack notifications for all terminations
The chaosmonkey tool is installed on our bastion hosts, preconfigured and ready to use for chaos experiments

That’s about it.

At this point, you may wonder if we really need all this complexity to kill some EC2 instances once in a while? Maybe not. It depends on your infrastructure and the type of chaos testing you’re doing. For us, this setup makes sense given the PaaS we have in place and the kind of GameDay exercises we’ve been performing on a regular basis. At the same time, I won’t recommend replacing Chaos Monkey and its many off-the-shelf features with a shell script. That’s just not sustainable in the long run.

Resilience testing should be second nature to engineers. It’s something we should be doing more often – without fear – and open source tools like Chaos Monkey facilitate this goal. I like the idea of having it at my disposal whenever I need it.

The Burden of Running Systems

2016-11-30T00:00:00+01:00

In The Power of Less Code I wrote:

A couple of weeks ago, we finished the migration from Dkron to Nomad for running all periodic batch jobs of our PaaS. […] And yet there’s a much better solution we would have preferred: not [operating Nomad] in the first place, but rather outsource the task of running periodic jobs to a hosted service provider with a paid support plan.

One reader replied that the solution to not writing code wasn’t to load it off to someone else, e.g. to a SaaS vendor, as dependencies would be just as bad.

That’s a good point. While I still believe in every single word I wrote, this comment made me realize that my original statements lacked depth and deserve more explanation from my side.

You’re not paid to write code

In his excellent post, You’re not paid to write code, Tyler Treat takes the same line as I do: writing code should always be the last resort and never the first option to add value to the business. However, Tyler looks at the topic from a different angle, drawing from systems theory. Some of his points are worth quoting here (emphasis mine):

[John] Gall’s Fundamental Theorem of Systems is that new systems mean new problems. I think the same can safely be said of code – more code, more problems. Do it without a new system if you can.

Every time you write code or introduce third-party services, you are introducing the possibility of failure into your system.

Systems are seductive and engineers in particular seem to have a predisposition for them. They promise to do a job faster, better, and more easily than you could do it by yourself or with a less specialized system.

Almost anything is easier to get into than out of. When we introduce new systems, new tools, new lines of code, we’re with them for the long haul. It’s like a baby that doesn’t grow up.

Sharp observations (and a nudge for me to learn more about Gall’s work).

In a nutshell, new systems mean new problems. Change introduces new forms of failure. We should therefore think twice before writing code or adding third-party services (which are systems too and require developing integration code).

Questions, questions, questions

This brings me back to Nomad and the general question of whether using a hosted service – assuming there is one – would indeed be better than operating such a system ourselves?

The answer is, of course, that it depends on the circumstances. It’s not always better to use SaaS products just as it’s not always better to operate systems in-house. There are no absolutes – you must weigh the tradeoffs.

At this point, you should ask yourself a lot of questions:

Does it make sense to pay another company for providing service X? After all, it’s their core business, not yours. It’s safe to assume they’re much more skilled in technology X.
Alternatively, would it be a good idea to operate the system in-house and take the burden of automation, monitoring, backups, bug fixing, updates, etc.? What other potential gains do you lose when choosing this alternative (opportunity cost)?
Do you avoid using already existing software and tend towards reinventing the wheel for the wrong reasons? Beware of the NIH syndrome!
Are you afraid to give up control and become dependent on a vendor? Do you trust the other company to do the right thing? Trust is paramount.
What services (if any) are there on the market? Is there a good fit in terms of features, licensing, security, SLA, support, etc.? Spend some time evaluating the available options. Do a quick spike if a product sounds promising. Run chaos experiments to verify your assumptions.
What are the total costs of using an existing service compared to hosting it all yourself?
Could it be that you underestimate the time it takes to build and run a similar system? Features are often much more complicated than we realize.
Will it be difficult to integrate with the external service? (Applying reliability design patterns such as timeouts and exponential backoff is an article of its own.) Conversely, what about getting rid of the dependency again? (The software is only done when it’s deleted.)

As you can tell, I’m biased. I prefer using a hosted service provider because that’s how we do things at Jimdo. Operating a system is always our plan B, never plan A. This strategy has been working very well so far.

Solving the right problem

Let’s wrap this up with a little reminder. Before deciding whether to run a system or not, I want you to step back for a moment and think again if you actually need more software to solve a problem or if you can somehow do without it.

In Rework, the business book by Jason Fried and David Heinemeier Hansson that is unlike any other book I know, they write:

Small is not just a stepping-stone. Small is a great destination in itself. […] expenses, rent, IT infrastructure, furniture, etc. These things don’t just happen to you. You decide whether or not to take them on.

We only started looking into running periodic batch jobs on our platform when other development teams kept asking for it. We’ve had the need for such a system, so we took it on.

At the end of the day, it’s not only about solving the problem right, but it’s also about solving the right problem.

Always leave the campground cleaner than you found it

2016-11-16T00:00:00+01:00

We’re currently joining another infrastructure team at Jimdo (let’s call it a restructuring measure). Among other things, this process involves merging the digital Kanban boards of both teams into one – what a great opportunity to go through our backlog and recklessly close tickets that are duplicates, done, or obsolete for one reason or another.

While studying our backlog, I noticed a couple of things:

We have too many tickets in our backlog (120+ before the cleanup).
Ideas are worth nothing unless executed.
Things we thought were important turned out not to matter at all.

More specifically, we never came around to complete some code refactoring tasks, despite the fact that several of these tickets are labeled “easy pick”.

Why is that? Is it a good idea to postpone refactorings? And if not, what’s the alternative?

Before trying to answer these questions, let’s look at the main reason we need to refactor our code in the first place.

Broken windows

In “The Pragmatic Programmer”, there’s a chapter called Software Entropy. In a nutshell, the chapter makes the point that entropy – the amount of disorder in a system, which tends to a maximum – is the reason that all projects run the risk of decaying during their lifetime. Yet, there are teams that “successfully fight nature’s tendency toward disorder” and manage to get software rot – technical debt – under control.

But how do they achieve this?

The key realization here is that “neglect accelerates the rot faster than any other factor”. Put another way: living with bad code and poor design decisions is likely to lead to even more bad code and poor design decisions.

This understanding is at the heart of the Broken Windows Theory:

The broken windows theory is a criminological theory of the norm-setting and signaling effect of urban disorder and vandalism on additional crime and anti-social behavior. The theory states that maintaining and monitoring urban environments to prevent small crimes such as vandalism, public drinking, and toll-jumping helps to create an atmosphere of order and lawfulness, thereby preventing more serious crimes from happening.

The nice thing about this theory is that it is not only true for crimes such as vandalism, it’s true for software development as well.

Once windows start breaking and nobody cares, more serious crimes will follow. The moment you accept substandard code and inadequate designs, your systems begin to deteriorate (and to slow development down as a result).

It’s a matter of mindset: the more broken windows in your codebase, the more likely people are to think “this code is crap anyway”, and the more “crimes” they’re going to commit. On the other hand, if the code is of high quality, people will probably take extra care not to mess things up.

Therefore, don’t live with broken windows.

Refactor early, refactor often

Addressing the issue of broken windows – or in our case, software rot – is the part where I disagree with “The Pragmatic Programmer” to some extent.

I do agree that you should refactor early and refactor often. Refactor too late and your productivity will decline until you decide to do something about it, which might be a massive undertaking at that point. So it’s generally a good idea to fix flaws as soon as you discover them – when the cost of change is lowest.

The other side of the coin: refactor too early and you risk making rash design decisions based on, well, guessing. In particular, there’s the danger of “optimizing before we know that we need to” aka premature optimization.

I take issue with the following paragraph from the book though:

If there is insufficient time to fix [a broken window] properly, then board it up. Perhaps you can comment out the offending code, or display a “Not Implemented” message, or substitute dummy data instead. Take some action to prevent further damage and to show that you’re on top of the situation.

Recommending to comment out code is questionable advice at best. Code that has no purpose should be killed to not become a source of distraction, confusion, and communication overhead – another broken window, if you will.

The book also suggests putting refactoring tasks on the schedule if you can’t do them immediately. However, if at all, I’ve only seen this work in practice when done as soon as possible, say, within a week. Our backlog is one, albeit small, example of this going wrong.

Incremental refactoring

I recently read an excellent blog post by Ron Jeffries. In it, he adds to the uncomfortable feeling I had about refactoring tickets. He argues that putting them on the backlog is indeed a bad idea, especially when you have a lot of refactoring to do to get back to a “clean field”.

Ron writes:

We took many weeks to get the code this bad, and we’ll surely not get that many weeks to fix it. […] A big refactoring session is hard to sell, and if sold, it returns less than we hoped, after a long delay.

What he suggests instead is to improve the code where we work:

We take the next feature that we are asked to build, and instead of detouring around all the weeds and bushes, we take the time to clear a path through some of them. Maybe we detour around others. We improve the code where we work, and ignore the code where we don’t have to work. We get a nice clean path for some of our work. Odds are, we’ll visit this place again: that’s how software development works.

This type of incremental refactoring reminds me of yet another programming wisdom…

The Boy Scout Rule

The Boy Scout Rule says:

Always leave the campground cleaner than you found it.

No matter who’s responsible for the mess, try to improve the environment for the next group.

I hope that now, at the end of this article, you can see how following that simple rule can make a huge difference to the evolution of our production systems – and to the teams building and maintaining them together.

“Boy Scouting” and an aversion to broken windows have been crucial in keeping the technical debt of our PaaS under control – even more so now that I know that refactoring tickets have no place in our backlog.

From Zero to Staging and Back

2016-11-02T00:00:00+01:00

When I joined Jimdo’s Werkzeugschmiede team in August 2015, the first major task I took on was building a staging environment for Wonderland, our in-house PaaS for microservices. We’re an internal service provider offering other teams a platform for running production services. As such, we take a great interest in the uptime of our infrastructure.

Since the very beginning of Wonderland, we knew that an isolated test environment matching production as closely as possible would give us more confidence to experiment, fix bugs, and implement new features. And indeed, having a pre-production environment for testing before deploying to production turned out to be invaluable – a safety net making the whole deployment process a lot less scary.

What follows is a detailed account of Wonderland’s staging environment: what it looks like, how we built it, and what we’ve learned since then.

Pair programming

Early on we decided to create the much-needed staging environment by pair programming. For me, working on a setup that is supposed to mirror production, and doing this with a coworker who knows Wonderland inside out, was an excellent way to learn about the platform and its different components.

I was able to ask questions when something was unclear and, at the same time, contribute my own ideas whenever I felt like it. This way, we created a fast feedback loop that not only helped me find my way through Wonderland, but also learn more about its creators – my new colleagues – and their modus operandi.

Taken all together, I highly recommend pairing for onboarding new team members, even if you don’t have the luxury of building a production-like environment from scratch.

One account per environment

Wonderland’s infrastructure runs on AWS. Rather than using a single VPC for both production and staging, we agreed to operate a dedicated AWS account per environment. This setup effectively isolates environments from one another. Most importantly, it prevents changes done in staging – whether intentionally or by mistake – from affecting production.

Other advantages of having one AWS account per environment include:

An easier understanding of the (cloned) infrastructure
Simpler automation code with fewer environment-specific exceptions
Finer access control on VPC level (no more fiddling around with subnets)
No naming collisions of non-VPC resources
Effortless tracking of costs per environment
Ability to opt-in to AWS features outside of prod first

On the downside, working with multiple AWS accounts makes credential management a bit more involved. To make up for this, we’ve been using awsenv and LastPass to quickly switch between accounts. (Of course, it’s all for naught if one forgets to use those tools…)

We actually took this separation one step further and also created additional “stage” accounts for all hosted services we rely on every day, such as Papertrail and Quay. The overhead has been worth it.

Automate all the things

We spent a lot of time automating the setup of our staging environment. We managed to get to the point where we could run make stage in our github.com/Jimdo/wonderland repository and Ansible would take care of everything, from bootstrapping our ECS cluster to provisioning Jenkins – our central state enforcer – to starting essential microservices of our PaaS.

To achieve that, we took the existing Ansible playbooks and CloudFormation templates for production and adapted them for use in staging. This meant we had to:

Replace hardcoded parameters like URLs and secrets
Implement missing automation steps (some prod resources had been clicked)
Address any issues that came up along the way (two words: eventual consistency)

It was also at this point that we decided to leverage standard make targets like “stage” or “prod” across projects. The following paragraph from How to build stable systems sums it up very well:

All projects, language notwithstanding, use the same tool for configuring and building themselves: make(1). Make can call into the given languages choice of build tool, but the common language for continuous integration and deployment is make(1). Use the same make targets for all projects in the organization. This makes it easy to onboard new people and they can just replay the work the CI tool is doing. It also “documents” how to build the software.

Even after automating all the things, there’s only one way to find out if our code works – and continues to work – as expected: creating staging from scratch, again and again.

Destroy all the things

In addition to make stage, we also implemented the inverse operation, make destroy-stage, to deprovision staging completely. This process boils down to deleting all CloudFormation stacks and other resources created by Ansible – in reverse order of creation.

Tearing down CloudFormation stacks is usually straightforward. However, we sometimes have to shell out to the AWS CLI because CloudFormation is very slow when it comes to adding new resources. This can lead to dependencies that are hard to remove. And even when Ansible does provide a particular AWS module, there’s no guarantee that the “absent” state is implemented correctly, making the dreaded CLI our only option.

Once make destroy-stage did the trick, we were able to bootstrap staging from scratch, which in turn allowed us to verify that our infrastructure code does the right thing when starting from a blank slate.

To further automate things, we created a Jenkins job in prod to destroy staging every Friday night and another one to rebuild it on Monday morning.

Drawbacks and improvements

While I’m happy with what we’ve achieved so far, there’s still room for improvement. Here are some of the challenges we’ve seen:

Having a single staging environment for all four members of our team means that we need to coordinate testing in a few cases. Nobody likes to wait, especially not me. One solution would be to split up those Jenkins jobs and/or the backend systems they target.
It currently takes at least 5 hours to rebuild staging. We’ve already outsourced the building and storage of Docker images to Travis and Quay. We could further accelerate the process, for example by baking AMIs of our cluster instances. Again, nobody likes to wait.
Unsurprisingly, we experienced a couple of problems caused by broken/missing external dependencies. One that comes to mind is Docker’s APT repository. Yes, mirrors would certainly help. We started using GitHub releases for hosting artifacts whenever possible.
Allspaw is right when he says that “[testing outside production] is incomplete because some behaviors can be seen only in production, no matter how identical a staging environment can be made”. We’ve learned this the hard way. Staging is a safety measure – no more, no less.
To be honest, the automated weekly rebuild of staging has caused us a lot of trouble and extra work lately. Sometimes the Jenkins job fails after running into API limits. Other times the job orchestration goes wrong due to network issues (or because Jenkins happens to have a bad day?). In any case, we need to make the process more reliable again.

For more on this topic, I recommend reading my other post: If it hurts, do it more often.

The Myth of the Root Cause: How Complex Web Systems Fail (Scalyr)

2016-10-20T00:00:00+02:00

The Power of Less Code

2016-10-19T00:00:00+02:00

These days, I spend a lot of time automating, debugging, and fixing the various components that make up Jimdo’s internal PaaS, which ultimately serves the 15+ million websites of our customers.

My team knows that reliable processes are essential. To that end, we continuously make our infrastructure code more robust, tune our monitoring and alerting setup, and run GameDay exercises on a regular basis.

We’re the team that has been building and running the PaaS – we own it – and this is unlikely to change soon. For that reason, we strive to keep the platform as simple as possible, while being aware that a certain amount of complexity is necessary for our systems to do anything useful.

Of course, we still do enjoy creating new things from time to time; we’re programmers, after all. For example, we love to build new automation tools and add service features to improve observability and scalability.

On the other hand – and this is the crucial point – we also care a lot about doing the opposite:

Deleting superfluous code
Removing features
Deleting documentation
Closing stale pull requests
Throwing away Git branches
Destroying cloud resources
Decommissioning projects

In other words, we try hard to get rid of anything that isn’t needed.

I know that more often than not, reality looks somewhat different:

Projects are on a tight schedule and cleaning up is not considered a priority.
The “DevOps department” has lost sight of server costs and no one has an eye on unused cloud resources.
Developers, in general, are more keen to work on new features, whereas deleting unfinished experiments isn’t half as much fun.

Sounds familiar, doesn’t it?

While the benefits of terminating unused cloud servers should be obvious to anyone who’s paying the bills, it might not be so clear otherwise. What exactly is so bad about developing more features, more tools, and therefore more code? Why bother deleting anything?

The fact of the matter is that code is a liability. You and your teammates are responsible for each and every line of code you produce.

As Jeff Atwood put it so eloquently in his popular blog post, The Best Code is No Code At All:

Every new line of code you willingly bring into the world is code that has to be debugged, code that has to be read and understood, code that has to be supported. Every time you write new code, you should do so reluctantly, under duress, because you completely exhausted all your other options.

On top of that, the Lean Enterprise book offers these valuable insights:

At first, we should propose solutions that don’t involve writing code […] Software development should always be a last resort, because of the cost and complexity of building and maintaining software.

Only spend time and effort on test automation for products or features once they have been validated. Test automation for experiments is wasteful.

Our most productive people are those that find ingenious ways to avoid writing any code at all.

Less code means less complexity, which means less bugs, which means less unexpected outcomes in production. Remember: simplicity is a prerequisite for reliability. The more complex a system, the more difficult it is to build a mental model of the system, and the harder it becomes to operate and debug it.

Code that has no purpose is a major source of distraction, confusion, and communication overhead (“Hey Mathias, what’s the point of this function parameter we don’t use anywhere?”). It’s poor practice to comment out unused code, or worse, to gate it with a feature flag. Today’s version control systems make it easy to revert any changes; there’s no reason not to remove dead code and other bloat, such as outdated documentation and pull requests that haven’t been updated in months. In the wise words of Kent Beck: complete it or delete it.

(Just to be clear: it’s totally possible to produce fewer lines of code by writing clever code. Don’t be clever either. It will make the code even harder to understand.)

Deleting many – sometimes hundreds or thousands – lines of code is indeed extremely satisfying. It’s a worthwhile investment of time, one that comes very close to the joy of building, at least for me.

A couple of weeks ago, we finished the migration from Dkron to Nomad for running all periodic batch jobs of our PaaS. It was a lengthy migration to say the least (operating Nomad in a highly-available fashion in AWS isn’t trivial). So I did one particular thing to keep me motivated: I created a pull request in which I prepared all changes to our infrastructure code required to decommission Dkron and its dependencies – weeks before actually pulling the plug on the stack and throwing away the 1000 lines of code for provisioning it.

Believe it or not, but this pull request – the prospect of reducing the overall complexity of our infrastructure – kept me excited about the project until the very end.

And yet there’s a much better solution we would have preferred: not doing all this work in the first place, but rather outsource the task of running periodic jobs to a hosted service provider with a paid support plan. Unfortunately, there’s no Nomad Enterprise (or something comparable that fits our needs) yet, so for now we’re left with the operational burden.

I hope this will change because the best code is no code at all.

Let’s acknowledge this fact by saying “no” to code more often than “yes”.

On Finding Root Causes

2016-10-05T00:00:00+02:00

In my previous article I introduced you to postmortems – what they are, why you should conduct them, and how to get started writing your own postmortem documents.

As a quick recap, a postmortem is a written record of an incident documenting its impact, what caused it, the actions taken to mitigate or fix it, and how to prevent it from happening again. In a broader sense, postmortems are a great tool for a company or organization to learn from failure.

One of the big questions a postmortem has to address is: What has caused the incident? – What’s the reason the system failed the way it did?

At first glance, finding the root cause – the initiating cause that led to an outage or degradation in performance – seems to be the rational thing to do. For system owners, knowing who or what is responsible for an incident appears to be a desirable goal. Otherwise, how else should they implement appropriate countermeasures?

In reality, however, trying to attribute an incident to a root cause in hindsight is not only impossible – it is fundamentally wrong.

This and most of the lessons that follow have their origin in Richard Cook’s paper “How Complex Systems Fail”. I already devoted a two-part series to his seminal work, but there’s so much more to learn from it, especially when it comes to postmortems. I think you’ll agree.

There is no single root cause

In complex systems, such as web systems, there is no root cause. Single point failures alone are not enough to trigger an incident. Instead, incidents require multiple contributors, each necessary but only jointly sufficient. It is the combination of these causes – often small and innocuous failures – that is the prerequisite for an incident.

As a consequence, we can’t isolate a single root cause.

One reason we tend to look for a single, simple cause of an outcome is because the failure is too complex to keep it in our head. Thus we oversimplify without really understanding the failure’s nature and then blame particular, local forces or events for outcomes.

One of the things I like about the postmortem template I mentioned last time is that it says “Root Causes”, not “Root Cause”. For me, that’s a testament to the fact that you need to look deeper if you only have a single root cause.

But even “Root Causes” might not be the best term, as Andy Fleener has pointed out to me on Twitter:

I definitely prefer “Contributing Conditions” over Root Causes though. Cause can only be constructed with the benefit of hindsight

That’s a good point that made me consider modifying our own postmortem template as well.

Hidden biases in our thinking

Hindsight bias continues to be the main obstacle to incident investigation. This cognitive bias, also known as the knew-it-all-along effect, describes the tendency of people to overestimate their ability to have predicted an event, despite the lack of objective evidence.

Indeed, hindsight bias makes it impossible to accurately assess human performance after an incident.

A similar but different cognitive error is outcome bias, which refers to the tendency to judge a decision by its eventual outcome. It’s important to understand that every outcome – successful or not – is the result of a gamble. The overall complexity of our web systems always poses unknowns. We can’t eliminate uncertainty.

After an incident has occurred, a postmortem might find that the system has a history of “almost incidents” and that operators should have recognized the degradation in system performance before it was too late. That’s an oversimplified view though. System operations are dynamic. Failing components and human beings are being replaced all the time. Attribution is not that simple.

We therefore need to be cautious of hindsight bias and its friends, and never ignore other driving forces, especially production pressure, when looking for root causes after an incident has occurred.

Human error is never a root cause

It’s the easiest thing in the world to point the finger at others when things go wrong. And unfortunately, many companies still blame people for mistakes when they should really blame – and fix – their broken processes.

Blameless postmortems only work if we assume that everyone involved in an incident had good intentions. This ties in with the Retrospective Prime Directive (a postmortem is a special form of a retrospective), which says:

Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.

Human error is NOT a root cause.

We should rather look for flaws in systems and processes – the causes contributing to failure – and implement measures so that the same issues don’t happen again. It requires systems thinking, which focuses on cyclical rather than linear cause and effect, to view the system as a whole in order to find out how it drifted into failure – both at a technical and organizational level.

Here’s one of my favorite passages on the topic, taken from the SRE book:

When postmortems shift from allocating blame to investigating the systematic reasons why an individual or team had incomplete or incorrect information, effective prevention plans can be put in place. You can’t “fix” people, but you can fix systems and processes to better support people making the right choices when designing and maintaining complex systems.

Does this mean that operators are off the hook? No, not at all. They’re the ones with the most knowledge surrounding the incident. For example, they know first-hand how the system failed in surprising ways. Hence, they’re responsible for finding ways to make the system more resilient – including writing a postmortem.

Practical example

Let’s wrap this up with an example from an actual postmortem.

Last time I told you a story about a recent outage at Jimdo. In a nutshell: To fix a broken deployment of our API service, I wanted to delete the corresponding ECS service in our AWS staging account. Unfortunately, I actually removed the service in our production account, causing our API to be down for half an hour. Oops!

In the postmortem that followed, we identified two root causes:

The tool we’re using to log into AWS accounts made it difficult to figure out in which account one operates. This contributed to deleting the ECS service in the wrong account. (We subsequently improved the tool to show more helpful environment information.)
The component that deploys services to our PaaS didn’t notice that the underlying ECS service was gone and still tried to update the (non-existing) service – which failed. This made it impossible to re-deploy the API service without manually removing all remaining pieces of the original service. (We’ve since fixed the operation to be idempotent.)

I’m sure that if we had looked closer, we would have found more root causes contributing to the failure, but we stopped here, made our homework, and moved on.

One more time: You can’t fix people, but you can fix systems and processes to better support them.

Keep this in mind when you write your next postmortem.

Writing Your First Postmortem

2016-09-21T00:00:00+02:00

I’m one of the operators of Wonderland, Jimdo’s in-house PaaS for microservices.

Two weeks ago, on September 5, I did something embarrassing at work.

We were debugging a broken deployment of our central API service. This API is nothing less than the entry point for managing all container-based services running on our platform, including most of our own system services (by virtue of dogfooding).

In an attempt to fix the problem we were experiencing – our API service failed to scale to a certain number of replicas – I deleted what I believed to be a duplicate instance of the corresponding ECS service in the AWS Management Console…

That turned out to be a mistake.

Instead of performing this action in our AWS staging account, as I intended to do, I accidentally deleted the ECS service in our production account. Worse, I did not delete some duplicate; it was the real thing.

To make a long story short, this blunder caused our API to be down for 31 minutes, mainly because it took us very long to figure out how to redeploy the broken API service.

Guess what I did immediately after resolving the incident and telling our users the good news?

I started writing a postmortem. Not because I had to, but because I know that postmortems are the ultimate tool to learn from incidents.

Postmortems 101

So, what is a postmortem?

A postmortem is a written record of an incident. Among other things, it documents the incident’s impact, what caused it, the actions taken to mitigate or fix it, and how to prevent it from happening again.

I will tell you more about the ingredients of a good postmortem later in this article. For now, I want you to understand that fixing the underlying issue(s) of an incident is important – but not enough. We also need a formalized process to learn from these incidents.

That’s what postmortems are for. Postmortems help us understand why incidents happen and how we might better prepare our systems for the future. We share this knowledge with other teams or, better yet, make postmortems public so that more people can benefit from them – and see that we actually care.

Failure isn’t a disaster, it’s a learning opportunity.

Similar to Chaos Engineering, conducting postmortems requires a fundamental shift in the mindset of managers and employees, if not whole companies. And like Chaos Engineering, postmortems have the potential to make a company more resilient as a whole.

We need to embrace failure for postmortems to work. This includes having blameless postmortems and no finger-pointing when looking for root causes. (By the way, human error is never a root cause, but more on that in the second part of this article.)

When to write a postmortem? That’s up to you. A good rule of thumb is to write one if an incident has an immediate impact on users (downtime or data loss) or if a stakeholder asks for it. Putting together a decent postmortem often takes a couple of hours, but the effort is usually worth it! Just remember to start the work as soon as possible, with events still fresh in mind.

Postmortem template

Inspired by another infrastructure team at Jimdo, we started using the Example Postmortem from the SRE book as a template for the postmortems we do for Wonderland.

If you follow the link, you’ll find a Markdown version of said template. I created it from the PDF book for two reasons: First, I wanted to share our postmortems as part of our standard Wonderland documentation, so that our users (other Jimdo teams) can easily find them. Second, all of our development is based on GitHub and we’re used to writing and reviewing Markdown files. In other words, my goal was to reduce barriers to reading, writing, and publishing our postmortems.

As far as I can tell, this initiative was a success. I’ve been astonished how quickly my teammates have adopted the new template. Ever since I published the first postmortem in this manner, they’ve been eager to do the same after a new incident has occurred. I’m aware that this is, for the most part, a matter of having the right mindset. However, it certainly doesn’t hurt to make the process more pleasant for everyone involved.

Publishing more postmortems ultimately means being more transparent about failures, which in turn builds trust in our team and makes our platform more reliable. We’re an internal service provider after all.

Your first postmortem

Now I encourage you to use the template as a foundation for your next – or perhaps first – postmortem. Give it a read. If there wasn’t an incident in last few days (I hope so!), think of the last time you had to deal with an outage. Then go through the different sections and try to fill in the blanks. Be consistent. Use the active voice throughout the document. Settle on a time zone and format. As with most templates, feel free to customize it to your own needs.

Here’s a summary of what each section in the template postmortem is about, including short examples:

Title – Name of the postmortem, e.g. “Deployer API Outage Postmortem”
Date – When the incident happened, e.g. “2016-09-05”
Authors – List of people who wrote the postmortem (GitHub handles work fine)
Status – Current status of the postmortem, e.g. “Complete, action items in progress”
Summary – A one-sentence summary of the incident, usually something like “Service X was down for N minutes due to Y”
Impact – The incident’s impact on customers and, if known, revenue or reputation, e.g. “Users were unable to do X while service Y was unavailable from 09:29 to 10:00 UTC”
Root Causes – A list of causes that have contributed to the incident (I’ll cover root causes in part 2 of this article)
Trigger – What triggered the outage? e.g. “Merging pull request X which started the rollout of broken software Y”
Resolution – The action(s) that mitigated and resolved the outage, e.g. “Disabling feature X helped to mitigate the problem. Rolling back to version Y resolved it.”
Detection – How the problem was noticed, e.g. “Pingdom detected that service X was down and paged on-call via PagerDuty”
Actions Items – A list of actions taken (with links to GitHub issues) to mitigate or resolve the incident, and to prevent it from recurring
Lessons Learned – What went well? What went wrong? And what was sheer luck?
Timeline – A detailed timeline of the events related to the incident
Supporting Information – Additional graphs, screenshots, command output, etc.

Postmortems are a collaborative effort thriving on feedback. Make sure to share first drafts internally with your team. Once the review is complete, share the postmortem with as many people as possible.

Always keep in mind: Failure isn’t a disaster, it’s a learning opportunity for the whole company.

That’s the end of part 1. In part 2, I’m going to dive into the wonderful world of root causes, probably the most important – and most difficult – element in conducting a postmortem.

Systems blindness and how we deal with it

2016-09-07T00:00:00+02:00

In what must have been an impulse purchase, I bought a paperback copy of Daniel Goleman’s book “Focus: The Hidden Driver of Excellence” a couple of weeks ago. I’m still only halfway through this hard-to-follow mishmash of ideas, all supposedly related to the overall theme of focus in a distracted world. There’s this one chapter in the book, however, that stands out to me. The chapter is titled “System Blindness” and it provides enough insights – food for thought – that I don’t regret reading the book. Here’s what I learned.

Mental models and outages

Systems are invisible to our eyes. We try to understand them indirectly through mental models and then perform actions based on these models.

As I wrote before, simplicity is a prerequisite for reliability. The more complex a system, the more difficult it is to build a mental model of the system, and the harder it becomes to operate and debug it.

As a matter of fact, the majority of outages are self-inflicted. We’re thinking about a change we’re going to make, but we don’t necessarily anticipate the negative consequences it might have on the system as a whole. We push a bad configuration or deploy a buggy Docker image and all of a sudden the website goes down. It has happened to all of us. I, for one, have certainly caused my fair share of outages.

For lack of a better word, I always used to refer to “unintended consequences” whenever I talked about these unexpected drawbacks that go along with complexity.

Then came Goleman’s book, which introduced me to the term systems blindness.

What is systems blindness?

Systems blindness is the main thing we struggle in our work. What we think of as “side effects” are misnamed. […] In a system there are no side effects – just effects, anticipated or not. What we see as “side effects” simply reflect our flawed understanding of the system. In a complex system […] cause and effect may be more distant in time and space than we realize.

According to Goleman, one of the worst results of systems blindness occurs when we implement a strategy to fix a problem but ignore the involved system dynamics. Many problems are, unfortunately, too macro or micro for us to notice directly.

One example he gives is that of building more and wider roads to avoid traffic jams, which will eventually lead to even more traffic as people will take advantage of the better travel connections and move further away from urban areas. Another example is global warming. Energy and climate are a system. Everything that we’re doing is part of the healing of that system. It’s a systemic problem; climate meetings and agreements can only do so much.

Likewise, everything that we’re doing affects the success and failure of our software systems. When something doesn’t work as expected, I want you to remember this lesson: There are no side effects, just effects that result from our flawed understanding of the system.

Illusion of explanatory depth

Another phenomenon related to systems blindness is the “illusion of explanatory depth”.

We often believe we understand how something works when in reality our understanding is superficial at best. In our industry, this illusion becomes apparent when trying to explain in depth how technology X works, where X might be: Kubernetes, the TCP/IP stack, Linux syscalls, AES encryption, consensus over Raft, the list goes on and on.

And even if someone has comprehensive knowledge of certain technologies, there’s still the challenge of grasping the dynamics – the feedback loops – of the larger system in which they’re embedded.

Distributed systems are hard for a reason.

Patterns and rules

To some degree, biology is to blame for our imperfect systems understanding:

[In contrast to self-awareness and empathy], there seems to be no dedicated network or circuitry in the brain that gives us a natural inclination toward systems understanding. We learn how to read and navigate systems through the remarkable general learning talents of the neocortex.

In other words, systems thinking – the cure for systems blindness – is a skill that must be learned, just like reading or programming. (Goleman notes that computer games can teach us how to experiment with complex systems.)

At the same time, we humans excel at “detecting and mapping the patterns and order that lie hidden within the chaos of the natural world” because our very survival depends on it.

We live within extremely complex systems, but engage them lacking the cognitive capacity to understand or manage them completely. Our brain has solved this problem by finding means to sort through what’s complicated via simple decision rules [e.g. trusting other people]

Our built-in pattern detector is able to simplify complexity into manageable decision rules.

This is a major reason why we, especially in IT, are obsessed with data (Big Data, anyone?). We strive to make the workings of our production systems visible by gathering and curating enough data points, like metrics and logs, that the dynamics of these systems become palpable. Armed with the myriad of supporting tools available today, we look for meaningful patterns within that data – knowing where to focus in a system is key – and take actions based on these patterns.

It is this observability that, in the absence of perfect understanding, helps us deal with the complexity of our systems.

I would like to thank Goleman for making me realize this connection.

The Obvious, the Easy, and the Possible

2016-08-24T00:00:00+02:00

Last year, when the show was still running, I used to binge-listen to Work In Progress, the podcast with Jason Fried (founder and CEO of Basecamp) and Nathan Kontny (CEO of Highrise).

In one of my favorite episodes, Fried talks about the idea of thinking about product development and interface design in terms of three buckets: the obvious, the easy, and the possible.

In this article, I’m going to explain what this idea is all about and why I think it’s so powerful – even if you’re more of a programmer person and don’t think of yourself as someone who can design web applications.

The three buckets

Whenever you build a product or feature, you have to figure out what matters – and to what degree – and what does not. You then focus development efforts appropriately.

According to Fried, you have to understand which things go in which bucket by asking yourself these three questions:

What needs to be obvious? The thing(s) people do all the time – the core of the product – should be obvious. On Twitter, for example, it’s obvious how to send a tweet thanks to the big blue Tweet button in the upper right corner. Not everything can be obvious, of course, because screen real estate, attention span, etc. are limited resources. It’s your job to decide.
What should be easy? The things people do frequently, but not always, should be easy. It can be hard to know the difference between the two, though. Byword, my writing app of choice, makes it easy to export a Markdown document to PDF via the File menu. In a writing app, exporting isn’t something people do all the time, yet it’s done often enough that it should be easy to accomplish.
What should be possible? The things people do sometimes – or rarely – should at least be possible. On GitHub, it’s possible (but not obvious or particularly easy) to create a new OAuth token via Settings -> Personal access tokens -> Generate new token -> Confirm password -> Enter token details. As there tend to be way more things in this bucket than in the other ones, you should ask yourself if making something at all is the right choice.

Command-line interfaces

You might be wondering why I, a systems guy who’s mostly into web infrastructure, care about design principles. Indeed, the only interfaces I normally “design” are command-line interfaces – the primary means of interaction with the bulk of server software.

But even though designing a web application and creating a CLI are two different beasts, the same principles apply. When developing a command-line tool, you too have to figure out:

What needs to be obvious? It’s best practice to provide commands for the most common operations. For instance, the Go tool makes it clear that go build will compile source code whereas go test will run tests. Most Go programmers use both many times a day, justifying the existence of these commands.
What should be easy? You should offer command-line options (or sub-commands) for actions users do frequently. The go get command, for example, downloads and installs packages along with their dependencies. When passed the -d option, it will skip the installation step, providing a convenient way to fetch all of a project’s dependencies.
What should be possible? You may allow users to enable advanced, experimental, or dangerous features via environment variables or lengthy command-line options. Setting the GO15VENDOREXPERIMENT variable in Go 1.5, for example, will tell the Go tool to resolve dependencies in vendor/ directories. For other use cases, it might be enough to output data in a structured format that’s easy for other tools to process.

Conclusion

No matter if you’re creating a web application, command-line tool, or any other user-facing product, thinking deeply about what needs to be obvious, easy, or possible can mean the difference between building something that merely gets the job done and something that’s a joy to use.

How Complex Web Systems Fail - Part 2

2016-08-10T00:00:00+02:00

In his influential paper How Complex Systems Fail, Richard Cook shares 18 brilliant observations on the nature of failure in complex systems. Part 1 of this article was my attempt to translate the first nine of his observations into the context of web systems, i.e., the distributed systems behind modern web applications. In this second and final part, I’m going to complete the picture and cover the other half of Cook’s paper. So let’s get started with observation #10!

10. All practitioner actions are gambles

Cook notes that all actions we take in response to an accident are just gambles. There are things we believe we know (e.g., because we built the system in a such-and-such way), but conversely, there are also things we don’t know (and even ones we don’t know we don’t know). The overall complexity of our web systems always poses unknowns. We can’t eliminate uncertainty – the guessing of what might be wrong and what might fix it.

As we learned in part 1 of this article, it’s impossible to correctly assess human performance after an accident due to cognitive errors like hindsight bias (see observation #8). A similar but different phenomena is the outcome bias, well illustrated by Cook:

That practitioner actions are gambles appears clear after accidents; in general, post hoc analysis regards these gambles as poor ones. But the converse: that successful outcomes are also the result of gambles; is not widely appreciated.

11. Actions at the sharp end resolve all ambiguity

More often than not, companies don’t have a clear direction when it comes to “the relationship between production targets, efficient use of resources, economy and costs of operations, and acceptable risks of low and high consequence accidents”, as Cook states. I would even go so far as to say that, in the absence of hard numbers, decisions are made following someone’s gut feeling.

This ambiguity is resolved by actions at the sharp end of the system, successful or not. After a disaster has struck in production, we’ll know, for example:

Management’s response to failure
What went well, what went wrong
If we need to hire another Site Reliability Engineer
Whether we should invest in employee training or better equipment

In other words, we’re forced to think and decide.

Once again, we need to be cautious of hindsight bias and its friends, and never “ignore the other driving forces, especially production pressure” after an accident has occurred.

12. Human practitioners are the adaptable element of complex systems

It’s people that keep web systems up and running by incrementally improving them – adapting them to new circumstances – so that they can survive in production.

The paper lists the following adaptations as examples:

Restructuring the system in order to reduce exposure of vulnerable parts to failure.
Concentrating critical resources in areas of expected high demand.
Providing pathways for retreat or recovery from expected and unexpected faults.
Establishing means for early detection of changed system performance in order to allow graceful cutbacks in production or other means of increasing resiliency.

It’s surprisingly straightforward to translate this list into best practices in the field of web operations: decoupling of system components, capacity planning, graceful error handling, periodic backups, monitoring, code instrumentation, canary releases, and so on.

13. Human expertise in complex systems is constantly changing

Complex systems require substantial human expertise in their operation and management. This expertise changes in character as technology changes but it also changes because of the need to replace experts who leave.

Furthermore, Cook writes that a complex system will always “contain practitioners and trainees with varying degrees of expertise”. Problems arise when knowledge isn’t spread equally in the team(s) responsible for the production stack.

I made the experience that pair programming is a very efficient way to share knowledge (yes, even in web operations). This is especially true when a legacy system is involved and your pairing partner happens to know more about it than he/she likes to admit…

14. Change introduces new forms of failure

As a matter of fact, even deliberate changes to web systems will often have unintended negative consequences. There’s a high rate of change and often a variety of processes leading to those changes. This makes it hard – if not impossible – to fully understand how all the bits and pieces resonate with each other under different conditions. Put another way, web systems are largely intractable, which is a major reason why outages are both unavoidable and unpredictable.

Cook adds to this what I consider one of the most useful insights I gained from his paper, worth quoting in length:

The low rate of overt accidents in reliable systems may encourage changes, especially the use of new technology, to decrease the number of low consequence but high frequency failures. These changes maybe actually create opportunities for new, low frequency but high consequence failures. [Because they] occur at a low rate, multiple system changes may occur before an accident, making it hard to see the contribution of technology to the failure.

15. Views of “cause” limit the effectiveness of defenses against future events

The statement that “post-accident remedies for human error are usually predicated on obstructing activities that can cause accidents” reminds me, more than anything else, of airport security theater, which also does little to prevent further accidents.

Cook urges us to not increase the coupling of our web systems in a knee-jerk reaction to failure:

Instead of increasing safety, post-accident remedies usually increase the coupling and complexity of the system. This increases the potential number of latent failures and also makes the detection and blocking of accident trajectories more difficult.

16. Safety is a characteristic of systems and not of their components

Chaos theory tells us that small causes – involving human action or not – can have large effects. Everything is connected. The paper says:

Safety is an emergent property of systems; it does not reside in a person, device or department of an organization or system. Safety cannot be purchased or manufactured; it is not a feature that is separate from the other components of the system.

Consider this example: Learning to embrace failure – a prerequisite for reliability – requires a fundamental shift in the mindset of managers and employees, if not whole companies. We can’t build reliable web systems by merely improving the codebase.

17. People continuously create safety

“Failure free operations”, as we learned today, “are the result of activities of people who work to keep the system within the boundaries of tolerable performance [on a moment by moment basis]”.

Most of these activities are well-known processes, probably documented in a runbook, such as reverting a bad deployment. Sometimes, however, it requires “novel combinations or de novo creations of new approaches” to repair a broken system. In my experience, the latter is in particular the case with irreversible failures, where you can’t simply undo the action that caused it.

18. Failure free operations require experience with failure

This last point is a topic near and dear to my heart.

When failure is the exception rather than the rule, we risk becoming complacent.

Recognizing hazard and successfully manipulating system operations to remain inside the tolerable performance boundaries requires intimate contact with failure.

For me, the perfect embodiment of this idea is Chaos Engineering, a discipline based on the realization that proactively triggering failures in a system – through intentional actions at the sharp end – is the best way to prepare for disaster.

If you want to learn more about Chaos Engineering, here are some links for you to check out:

What better way to end this article than to leave you with Cook’s responses to part 1 and part 2, in which he provides additional context and links to his Velocity talks. Highly recommended for further study!

How Complex Web Systems Fail - Part 1

2016-07-27T00:00:00+02:00

There’s this one paper that keeps popping up on my radar. I think it’s about time I give it the attention it deserves. I’m talking about How Complex Systems Fail by Richard Cook. This seminal paper, published in 2000, covers 18 sharp observations on the nature of failure in complex medical systems. The nice thing about these observations is that most of them hold true for complex systems in general, including our beloved web systems.

Distributed web-based systems are inherently complex. They’re composed of many moving parts – web servers, databases, load balancers, CDNs, routers, and a lot more – working together to form an intricate whole.

In this article, which is part 1 of 2, I’ll go through the first half of Cook’s observations, one by one, and try to translate them into the context of web systems. (In part 2, I’ll cover the other half.)

1. Complex systems are intrinsically hazardous systems

This is certainly true for safety-critical systems in industries like medicine, transportation, or construction where errors can mean the difference between life and death. While most web systems fortunately don’t put our lives at risk, the general response to failures is the same: creating defense mechanisms against potential hazards inherent in those systems. Which brings us to the next point…

2. Complex systems are heavily and successfully defended against failure

We put countermeasures in place – backup systems, monitoring, DDoS protection, runbooks, GameDay exercises, etc. – because we dread the consequences of failure, such as service outages and data loss. These measures are supposed to “provide a series of shields that normally divert operations away from accidents”. And luckily, they’re successful most of the time.

3. Catastrophe requires multiple failures – single point failures are not enough

Cook writes:

Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident. Each of these small failures is necessary to cause catastrophe but only the combination is sufficient to permit failure.

Most failure trajectories are successfully blocked by the aforementioned defenses or by the system operators themselves.

Later in this article, you’ll learn why there’s no such thing as a single root cause.

4. Complex systems contain changing mixtures of failures latent within them

High complexity ensures there are multiple flaws – bugs – present at any given moment. Operators have to deal with ever-changing failures due to “changing technology, work organization, and efforts to eradicate failures”. Anyone who’s worked on a larger software project knows this is true. At some point, someone did something and it had an unintended consequence.

According to Cook, we don’t – and can’t – fix all latent bugs because of “economic cost but also because it is difficult before the fact to see how such failures might contribute to an accident”. We’re prone to think of these individual defects as “minor factors during operations”. However, as we just learned, several of these supposedly minor factors can lead to catastrophe.

5. Complex systems run in degraded mode

A consequence of the preceding observation is that “complex systems run as broken systems”. Most of the time, they continue to work thanks to redundancies – database replicas, server auto scaling, etc. – and thanks to knowledgeable operators who fix problems as they arise.

But at some point systems will fail. It’s inevitable.

A postmortem might find that “the system has a history of prior ‘proto-accidents’ that nearly generated catastrophe” and that operators should have recognized the degradation in system performance before it was too late. However, that’s an oversimplified view. We need to realize, instead, that “system operations are dynamic, with components (organizational, human, technical) failing and being replaced continuously”. Attribution is not that simple, as you’ll see in a minute.

6. Catastrophe is always just around the corner

Disaster can occur at any time and in nearly any place. The potential for catastrophic outcome is a hallmark of complex systems. It is impossible to eliminate the potential for such catastrophic failure; the potential for such failure is always present by the system’s own nature.

Just because there are no problems now doesn’t mean it’s going to stay that way. Sooner or later, any complex system will fail. That’s why operators should never get too comfortable.

As I wrote in the past, complacency is the enemy of resilience. The longer you wait for disaster to strike in production – merely hoping that everything will be okay – the less likely you are to handle emergencies well, both at a technical and organizational level.

7. Post-accident attribution to a root cause is fundamentally wrong

In complex systems, such as web systems, there is no root cause. Instead, accidents require multiple contributors, each necessary but only jointly sufficient. In the words of Cook:

Indeed, it is the linking of these causes together that creates the circumstances required for the accident. Thus, no isolation of the ‘root cause’ of an accident is possible.

One of the reasons we tend to look for a single, simple cause of an outcome is because the failure is too complex to keep it in our head. Thus we oversimplify without really understanding the failure’s nature and then “blame specific, localized forces or events for outcomes”.

8. Hindsight biases post-accident assessments of human performance

The key point made here:

Hindsight bias remains the primary obstacle to accident investigation, especially when expert human performance is involved.

Wikipedia has a good explanation of hindsight bias:

Hindsight bias, also known as the knew-it-all-along effect […] is the inclination, after an event has occurred, to see the event as having been predictable, despite there having been little or no objective basis for predicting it.

Which tells us that it’s impossible to accurately assess human performance after an accident, e.g. when doing a postmortem. Still, many companies continue to blame people for mistakes when they should really blame – and fix – their broken processes.

9. Human operators have dual roles: as producers and as defenders against failure

Operators actually have not one but two roles, each with its own demands. On the one hand, they operate the system so that it can do what it’s supposed to do. On the other hand, they defend the system against failures. According to Cook, this poses the following problem:

Outsiders rarely acknowledge the duality of this role. In non-accident filled times, the production role is emphasized. After accidents, the defense against failure role is emphasized. At either time, the outsider’s view misapprehends the operator’s constant, simultaneous engagement with both roles.

This duality reminds me, in a way, of today’s Site Reliability Engineers, who are responsible for ensuring that services are available and fast enough, and who also progress the software and systems behind those services. This duality is, in fact, at the heart of SRE. I’m glad our industry has started to embrace this idea.

That’s the end of part 1. Of course, I encourage you to read the original treatise to get the whole picture. I found a lot of value in it – and so might you.

Continue with part 2 of this article.

Toil: A Word Every Engineer Should Know

2016-07-13T00:00:00+02:00

Toil.

I’ve stumbled upon this word a few times already, mostly in the context of automation or system administration. I knew that it means something negative, something that should be avoided. But to be honest, it was only through writing this article – a useful technique for closing knowledge gaps, by the way – that I figured out what the word actually means when used in the world of engineering.

Here’s what I’ve learned.

What exactly is toil?

For research, I’ve consulted two of my favorite technical books. The Cloud book mentions the term a couple of times in regard to automation and lists ways to assess and limit the amount of toil. The SRE book, the primary source for this article, even has an entire chapter on toil and how to eliminate it.

The SRE book defines toil as follows:

In SRE, we want to spend time on long-term engineering project work instead of operational work. Because the term operational work may be misinterpreted, we use a specific word: toil. […] Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.

More precisely, the following work can be considered toil:

Work such as manually running a script (even if the script itself automates some task)
Work that is performed over and over again
Work that could just as well be accomplished by a machine (human judgment isn’t essential)
Work that is interrupt-driven and reactive, like pager alerts, rather than strategy-driven and proactive
Work that does not permanently improve your service once completed
Work that scales up linearly with service size, traffic volume, or user count

Why is toil a bad thing?

Toil is not a bad thing per se. There’ll always be some amount of toil – unavoidable grunge work – that you need to take care of as an engineer. You might even look forward to these quick wins from time to time. That’s fine. There’s a problem, however, if toil makes up the majority of your work.

Here are some of the reasons why too much toil is harmful, again taken from the SRE book:

Too much toil leads to burnout, boredom, and discontent.
Manual work and firefighting will prevail at the expense of shipping new features.
Too much toil creates confusion about what SRE actually entails.
Other (development) teams may start expecting SREs to take on even more toil.
Current or future teammates are more likely to look for another job, esp. if they were promised project work.
Your career will stagnate if you spend too little time on long-term engineering projects.

I think the last aspect is particularly important. But what exactly qualifies as engineering work?

Engineering work defined

Engineering work is novel and intrinsically requires human judgment. It produces a permanent improvement in your service, and is guided by a strategy. It is frequently creative and innovative, taking a design-driven approach to solving a problem – the more generalized, the better. Engineering work helps your team or the SRE organization handle a larger service, or more services, with the same level of staffing.

Examples include:

Creating automation scripts, tools, or frameworks
Adding service features for scalability and reliability
Modifying infrastructure code to make it more robust
Configuring and tuning production systems (servers, load balancers, etc.)
Setting up monitoring and alerting
Writing documentation that has long-term value
Consulting on architecture, design and putting systems into production

The 50/50 rule

Without a doubt, knowing what work falls into which category – toil or long-term engineering work – is useful. Another thing you need to know is how to split your valuable time between the two.

Fortunately, the SRE book has a pragmatic answer for this too:

Our SRE organization has an advertised goal of keeping operational work (i.e., toil) below 50% of each SRE’s time. At least 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil or add service features. Feature development typically focuses on improving reliability, performance, or utilization, which often reduces toil as a second-order effect.

From now on, I’m going to strive for spending, on average, at least 50% of my time on engineering work. Furthermore, it’s time for me to stop using the phrase “monkey work” because, as we learned today, there exists a much better, more empathetic word.

If it hurts, do it more often

2016-06-30T00:00:00+02:00

There’s a good chance that regular exercise will help you become a healthier person. The more often you go to the gym and lift weights (or whatever you like to do), the more strength you will gain, and as a result, the easier it becomes to repeat the same exercise. This is possible because your body will slowly but gradually adapt to the stress of exercise.

While that’s obviously a simplified explanation – there’s a lot more to be said about adaptation and exercising the right way without getting injured – this is generally true:

Practice something long enough and you will get better at it. Or putting it in a more catchy way: If it hurts, do it more often. Eventually, the pain will fade away.

The great thing about this principle is that it’s not limited to sports – you can apply it to any useful activity that can be done frequently:

Do you consider yourself to be a mediocre speaker? Keep giving presentations at local user groups until you feel more comfortable.
Do you have trouble writing blog posts? Make it a habit to write 250 words a day and the pain will vanish (mostly).
A different kind of example: Entrepreneurs are supposed to wear many hats. By spending time in customer service, they can learn what it takes to provide great support – and only then consider hiring for the position.

Now let’s see how this idea applies to software development, using continuous integration as an example.

The pain of continuous integration

In his post, Frequency Reduces Difficulty, Martin Fowler writes:

Most programmers learn early on that integrating their work with others is a frustrating and painful experience. The natural human response, therefore, is to put off doing it for as long as possible.

He continues:

[…] if you do it more frequently, you can drastically reduce the pain. And this is what happens with Continuous Integration – by integrating every day, the pain of integration almost vanishes. It did hurt, so you did it more often, and now it no longer hurts.

But why is doing painful things over and over a good idea?

According to Fowler, there are three main reasons, all of them manifested in agile thinking:

Decomposition. Before you can execute tasks more frequently, you need to decompose large tasks into smaller chunks to make them easier to handle. Today we mainly talk about decomposing applications into microservices. However, breaking complex systems down into smaller classes or objects has been a best practice of object-oriented programming for decades.
Feedback. In software development, fast feedback is important in order to adjust more quickly to changes. Fast feedback involves reducing the time between modifying code and getting test results. Being able to do something more frequently, like integrating code changes, leads to faster feedback.
Automation. As I touched on earlier, practice – together with reflection – help you improve any skill. Having to redo a time-consuming task many times makes you want to automate it because you know that reliable automation increases speed and reduces errors. After feeling the pain for some time, you’re in a better position to automate the task based on newly acquired knowledge.

So pain doesn’t necessarily have to be a bad thing. It can lead to better systems that are decomposed, automated, and easier to change and reason about.

Our staging environment

We, the members of Jimdo’s Werkzeugschmiede team, also like to feel the pain before doing something about it.

For example, before investing time in setting up a proper staging environment, we had experienced how difficult it was to test changes prior to deploying them to production. Later, when staging was in place, we learned the hard way that our automation wasn’t as reliable as we thought. We ignored the problem for a while – again feeling the pain – until we finally decided to rebuild staging from scratch once a week, thereby ironing out any flaws in our automation.

There are still other things that hurt us in different ways. Let’s do those things more often and see where it leads us…

A Little Story about Amazon ECS, systemd, and Chaos Monkey

2016-06-13T00:00:00+02:00

Today I want to share a little story. The story is about the challenge of making one component of our core infrastructure more resilient to failures.

Before jumping to the meat of it, however, it will be helpful to have an understanding of what the infrastructure stack in question does and how it works.

A quick look at Wonderland

I’m currently working as an Infrastructure Toolsmith in Jimdo’s Werkzeugschmiede team. In brief, our goal is to provide a platform (PaaS) where all Jimdo developers can deploy and run their Dockerized applications with as little friction as possible.

Our platform, which we internally call Wonderland, utilizes the Amazon EC2 Container Service (ECS). ECS is essentially a cluster scheduler that maps a set of work (batch jobs or long-running services) to a set of resources (EC2 instances). To connect instances to an ECS cluster, one has to install and run a daemon called ecs-agent on them. ecs-agent will then start Docker containers on behalf of ECS. Wonderland, in turn, talks to the ECS API whenever a user triggers a deployment.

There are actually more moving pieces involved, but that’s the gist of it.

What can go wrong

For our setup to work, it’s important that the following conditions are met on each cluster instance:

ecs-agent is running properly to get instructions from ECS.
Docker daemon is running properly to start containers on behalf of ecs-agent.
/var/lib/docker is backed by a functioning Amazon EBS volume with enough storage for Docker.

According to Murphy’s law, anything that can go wrong will go wrong. Here are some of the problems we had to deal with in the past:

Deploying a newer (and broken) version of ecs-agent that would disconnect from ECS after a while. We did a rollback. These days, we do more testing before updating ecs-agent in production.
Docker daemon becoming unresponsive. After lots of debugging, we found out this was due to Datadog agent collecting disk usage metrics per container. After disabling the corresponding “container_size” metric the problem was gone.
EBS volume running out of inodes (“no space left on device”), which was caused by Docker containers with many small files in them. The fix was to switch the underlying filesystem from EXT4 to XFS.
Formatting of EBS volume failed with “mkfs.xfs: /dev/xvdb appears to contain an existing filesystem (xfs)”. Without the mounted device, Docker quickly filled up the small root filesystem. Only happened once, though. We ignored it.

Self-healing infrastructure

Of course, we want our infrastructure to be resilient to failures. If an EC2 instance becomes unhealthy – if it’s unable to run Docker containers for whatever reason – it should be replaced without any manual intervention.

To achieve this, all of our cluster instances are managed by an EC2 auto scaling group. The associated load balancer is configured to send an HTTP request to each instance to perform a health check (the specific endpoint is :51678/v1/metadata). If ecs-agent does not respond with “200 OK” for some time, the instance will be terminated and a new one will be started immediately.

That’s the theory. In practice, monitoring ecs-agent alone is not enough, as the used endpoint will happily report that everything is fine even if, for example, /var/lib/docker isn’t writable…

Now what?

systemd to the rescue

For the cluster instances, we decided to use CoreOS, a container-focused Linux distribution optimized for large-scale deployments. Like most modern Linux distributions, CoreOS uses systemd as its init system.

These are the systemd unit files we use that are of interest:

format-ephemeral.service – formats the EBS volume using mkfs.xfs
var-lib-docker.mount – mounts the EBS volume to /var/lib/docker
docker.service – starts the Docker daemon (this unit ships with CoreOS)
ecs-agent.service – starts ecs-agent

From the beginning, we had requirement dependencies in place (via Requires=) to ensure units are started and stopped at the right time. For example, var-lib-docker.mount requires format-ephemeral.service, and ecs-agent.service requires docker.service to work.

We want to take advantage of the fact that systemd will stop ecs-agent if any of its dependencies fails, which would cause the unhealthy instance to be replaced. Self-healing infrastructure for the win!

The missing link

The attentive reader will already have noticed that – until recently – we had been missing one important dependency in the chain: ecs-agent, which provides our health check endpoint, absolutely requires the EBS volume to be formatted and read-write mounted to /var/lib/docker. Without the volume, the cluster instance can’t work reliably.

I tried to remedy this fact by adding Requires=var-lib-docker.mount to ecs-agent.service. Some testing showed, however, that the agent won’t be stopped if the mount point is unmounted at runtime. Luckily, one of my colleagues suggested to give BindsTo= a try instead:

Configures requirement dependencies, very similar in style to Requires=, however in addition to this behavior, it also declares that this unit is stopped when any of the units listed suddenly disappears. Units can suddenly, unexpectedly disappear if a service terminates on its own choice, a device is unplugged or a mount point unmounted without involvement of systemd.

Which turned out to be what we needed. With BindsTo= in place, I executed umount /var/lib/docker and systemctl stop var-lib-docker.mount, and in both cases ecs-agent was stopped by systemd – and the instance was terminated!

So all is well. Or is it?

Chaos Monkey and XFS

To be honest, neither unmounting the filesystem nor stopping the mount unit are good ways to simulate a failure of EBS. I knew how to do better, though.

As a big fan of Chaos Monkey, I couldn’t resist using it for a quick chaos experiment: What would happen if, instead of unmounting the EBS volume gracefully, it was forcefully detached at runtime? I assumed that the result would be the same: systemd would stop ecs-agent.

To validate my assumption, I used a little command-line tool I’ve developed to trigger chaos events via the Chaos Monkey REST API:

$ chaosmonkey -endpoint "$WONDERLAND_ENDPOINT" \
    -username "$WONDERLAND_USER" -password "$WONDERLAND_PASS" \
    -group wonderland-crims-CrimsAutoScalingGroup \
    -strategy DetachVolumes

The “DetachVolumes” chaos strategy will force-detach all EBS volumes from a randomly selected EC2 instance of the given auto scaling group. As a result, EBS disk I/O will fail.

The result was surprising, at least to me.

You can read up the details here. In a nutshell, systemd did not terminate ecs-agent because of XFS. If an error was detected, the Linux XFS driver – by design – won’t unmount the filesystem in order to keep it consistent (reading or writing data will result in I/O errors). Hence, systemd correctly reported that the device was still mounted, leaving us with a broken system that won’t be able to start new Docker containers…

Conclusion

Unfortunately, we cannot rely on our current health check to correctly detect EBS failures. To make up for this, we probably need to provide our own monitoring daemon for more sophisticated checks. This is something we wanted to avoid because, as the saying goes, no code is better than no code.

I myself have learned a lot about systemd, EBS, and XFS. Even though our infrastructure isn’t perfect yet, we’re more confident in our system now than we were prior to the experiments. It’s not the first time Chaos Engineering has proved our assumptions wrong.

Still, there’s much to learn from failure.

Simplicity: A Prerequisite for Reliability

2016-05-31T00:00:00+02:00

I started reading the Site Reliability Engineering book, commonly referred to as “the SRE book”. In this collection of essays, members of Google’s SRE team write about how they run their production systems. The book contains a short chapter on a topic near and dear to my engineering heart: simplicity and the role it plays in SRE. In this article, I’m going to summarize the lessons from that chapter and add some thoughts of my own.

Stability vs. agility

SRE is about keeping agility and stability of production systems in balance. Running a reliable service is easy – because it’s frozen. But: running an agile reliable service is hard. The challenge is to change an application that is running correctly, while keeping the service running correctly.

Reliable processes can increase developer agility. Reliable production rollouts, for example, make it easier to link changes to bugs. This in turn allows developers to focus on more important things, such as functionality and performance of their software.

This separation of concerns is, in fact, one of the reasons why cluster managers like Kubernetes exist. Developers can talk to the Kubernetes API to deploy their application containers in a simple and reliable way without having to worry about cluster nodes, host OS/kernel, or underlying hardware.

Accidental vs. essential complexity

Software should behave predictably and accomplish its goals without too many surprises (that is, outages in production). The number of surprises directly correlates with the amount of unnecessary complexity found in a project. It’s therefore crucial to think about accidental complexity and essential complexity:

Accidental complexity relates to problems which engineers create and can fix, [whereas] essential complexity is caused by the problem to be solved, and nothing can remove it
– Fred Brooks in his seminal “No Silver Bullet” essay

SRE teams should push back when accidental complexity is introduced into the systems they’re in charge of. Besides, they should continue to eliminate complexity over time.

As a matter of fact, it can be hard to know what is essential complexity and what is accidental complexity. From my experience, you can successfully avoid unnecessary complexity by:

Constantly asking yourself what it is you want to accomplish
Thinking deeply about the problem domain
Saying “no” by default
Outsourcing work to service providers, if possible

(I’m aware that these points need more elaboration, most likely in the form of another article.)

Code is a liability

It’s poor practice to comment out unused code, or worse, to gate it with a feature flag. Code that has no purpose is a major source of distraction and confusion. Today’s version control systems make it easy to revert any changes; there’s no reason not to remove dead code and other bloat. Less code means less complexity, which means less bugs, which means less unexpected outcomes in production. Deleting many – sometimes hundreds or thousands – lines of code is indeed very satisfying.

At the same time, we should think twice before adding new features. Once again, this comes down to essential vs. accidental complexity and saying “no” to many things in order to focus on the core problem.

Minimal APIs

Writing clear, minimal APIs is key to managing simplicity in software systems. Smaller APIs with fewer methods and arguments are not only easier to understand and test, they also allow us to put more effort into comprehending the actual problem we set out to solve.

As you can see, there’s a pattern involved here: less is more.

Modularity and decoupling

Many concepts of object-oriented programming (OOP) also apply to the design of distributed systems. For instance, both involve breaking problems up into small, manageable components. No surprise then that both benefit from loose coupling – the ability to update parts of a system in isolation – as an effective method for increasing developer agility and system stability. A decoupled system, in particular, reduces the probability of unintended consequences.

The idea of modularity also extends to APIs. API versioning allows developers to upgrade individual components to a newer version in a controlled way, rather than forcing all teams to upgrade in “big bang” fashion. What appears to be an extra burden at first, actually allows introducing new features and deprecating old ones in a safe manner.

Similar to how all code should have a purpose, any part of a distributed system – microservices, binaries, etc. – should be responsible for solving one particular well-defined problem. Those parts are connected by clearly defined interfaces (API, CLI, etc). This, in fact, is the Unix philosophy at work.

Small releases

Nobody likes to review pull requests that are too long. It’s hard to measure the impact of many code changes at a time, even more so if the changes are unrelated. The same is true for releases. It’s difficult to understand the impact (e.g. on performance) when deploying dozens of unrelated changes to a system at the same time.

Simple releases performed in smaller batches of changes are easier to measure – and to revert, if necessary.

Conclusion

I want to end this article, quite appropriately, with one of my favorite quotes:

Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.
– Kernighan and Plauger in “The Elements of Programming Style”

The more complex a system, the more difficult it is to build a mental model of the system, and the harder it becomes to operate and debug it. As Edsger W. Dijkstra put it, “Simplicity is prerequisite for reliability”. It’s important to put a lot of thought into simplifying our designs while still providing the required functionality. Systems tend to become more complex over time – the earlier you start simplifying as outlined here, the better off you will be.

Fast Feedback Is Everything

2016-05-19T00:00:00+02:00

As an engineer, you should constantly work to make your feedback loops shorter in time and/or wider in scope. – @KentBeck

As software engineers, we spend a lot of our time writing code. Whether we are implementing new features, fixing nasty bugs, or doing boring maintenance work, there is always some code we either create from scratch or try to modify for the better. When developing code, we need certainty that our changes work as intended. That’s why we write tests after (or before) the fact. As a result, a huge part of our daily work inevitably comes down to these steps:

Write some code.
Write a test that defines the desired behavior.
Run the test to see if it’s successful.
Go back to 1. or 2. if the test fails. Rinse, repeat.

(For the sake of this article, let’s ignore the tedious test-first vs. test-last vs. TDD debate.)

In order to write an actual program, you have to run through these steps – this loop – again and again. But here’s the thing: constantly jumping back and forth between programming and testing comes at a price. It doesn’t just slow you down in terms of time spent; the involved context switching also drains your mental energy, which might ultimately destroy your productivity.

For this reason, I believe that the following statement is so important – and I’m not tired of repeating it whenever I get the chance:

When it comes to programming and testing, fast feedback is everything.

Fast feedback involves reducing the time between changing code (or tests) and getting test results to a minimum. The faster this feedback loop, the more productive you will be.

There are a couple things you can do to shorten the feedback loop. Certainly the first technique coming to mind is isolated testing, which involves eliminating (slow) external dependencies like databases. Besides trying to implement faster tests, however, you can also optimize the way you run them.

Running tests the fast way

At first glance, how you run tests might not seem to have a big impact on the feedback loop. If a test takes a minute to finish, does it really matter if we can shave off a second or two by tweaking the running step?

Yes, it does. Seconds add up over time. Each additional step requires a little more brain power and incurs a significant context-switching cost.

I don’t like wasting my time with work that can easily be avoided. If there’s a way to minimize the cost of context switching, I’m more than happy to add it to my toolbox. By following the following three steps, I’ve managed to run tests faster and, more importantly, become more productive as a result.

Figure out how to execute individual tests. During development, don’t run the entire test suite each time you change a bit of code. Aside from the fact that running all tests is often very slow, it’s always better (and faster!) to first get feedback on local code changes before integrating with other code. Reducing the scope by testing a small subset of code in isolation is not only faster, it also helps you find bugs, and it’s a must-have for TDD. (It goes without saying that you or your continuous integration system should run all the tests at some point.)
Write a test runner. This is optional and depends on your test framework/setup. For example, RSpec – one of the best frameworks for testing in Ruby – already allows you to execute a specific test file or even a single test case in that file. Unfortunately, it’s not always that easy. Sometimes you need to execute additional setup/teardown tasks, other times running tests on a package level may be the best you can do. That’s where a test runner comes in handy. In its most basic form, a test runner is a shell script that takes a single argument – the filename of the test you’re currently working on – and does everything required to run the test. I usually store this script as script/test in every project I need it.
Run tests using a keyboard shortcut. For fast feedback, it’s important to not leave your editor while hacking on code. Instead, configure your editor of choice to execute tests when pressing a combination of keys on your keyboard. At a minimum, set up a shortcut to run the test currently open in your editor by passing its filename directly to the respective testing tool – or a custom test runner. It’s also useful to have a shortcut for running the test case under the cursor, which will further narrow the focus of your testing.

Examples

Let me give you three practical examples. All of them are somehow related to infrastructure automation, both because it’s an area where rapid feedback matters all the more and because it’s what I do for a living. You will see that I’m a fan of Vim, but it should be straightforward to achieve the same with other editors as well. Here we go:

rspec-puppet is a test framework that allows to write RSpec tests for Puppet code. When I started working at Jimdo in 2013, it wasn’t possible to run individual tests by simply pointing RSpec at a test file in our codebase. One reason is the unusual way test fixtures are handled in the Puppet world. To remedy this, I wrote a test runner script. Together with vim-spec-runner, a Vim plugin that automatically sets up keyboard shortcuts for running tests, we had everything in place for testing our Puppet code at the whim of a keystroke.
I primarily developed chef-runner for use with Vim. Instead of jumping back and forth between editing a Chef recipe and running the painfully slow vagrant provision command, I wanted to be able to change code and get immediate feedback without having to leave the editor. chef-runner’s ability to rapidly provision a machine with just a single Chef recipe – the file currently open in Vim – made this possible. There’s no Vim plugin; the setup is as simple as sticking a one-liner in your Vim configuration.
chef-runner used to be a 100-line shell script before I decided to rewrite it in Go. Go comes with first-class testing support. The go test command is used to run tests (_test.go files) and report test results. However, the tool itself can only run tests for one or more packages based on their import paths; it cannot handle arbitrary _test.go files. These days, I use the :GoTest command from the excellent vim-go plugin to execute package tests for a specific source file.

Wrapping up

Fast feedback plays an important role in software development. Optimizing the way we run tests is one effective method to shorten the feedback loop and get things done.

Sometimes all it takes is a tiny shell script.

Acknowledgment: The ideas presented in this article were heavily inspired by the excellent Destroy All Software screencasts by Gary Bernhardt.

Chaos Monkey for Fun and Profit

2016-05-04T00:00:00+02:00

This article will pick up where Chaos Engineering 101 left off and cover a slightly more advanced principle of Chaos Engineering: automating chaos experiments.

The core idea of Chaos Engineering, as we recall, is to inject failures proactively in a controlled manner in order to gain confidence in our systems. Chaos Engineering enables us to verify that things behave as we expect – and to fix them if they don’t.

In Chaos Engineering 101, I argued that you don’t need to automate chaos experiments when you’re just getting started. I still think that manual testing, which can be as simple as terminating a process with the kill command, is the easiest way to get familiar with the concept of fault injection and to gradually establish the right mindset. At the end of the article, I explained how Jimdo runs GameDay events, which are typically based on manual fault injection as well.

The next level of chaos

The Principles of Chaos Engineering, as formulated by Netflix, currently list four advanced principles of chaos. The document says:

The [advanced] principles describe an ideal application of Chaos Engineering […] The degree to which these principles are pursued strongly correlates to the confidence we can have in a distributed system at scale.

The one principle we’re interested in today is described as follows:

Running experiments manually is labor-intensive and ultimately unsustainable. Automate experiments and run them continuously. Chaos Engineering builds automation into the system to drive both orchestration and analysis.

In other words, the Principles suggest automating experiments (that used to be manual) to run continuously in order to further increase confidence in our systems.

Fortunately, Netflix does not only tell us what to do; they also gave us a mighty tool for putting theory into practice: Chaos Monkey.

Chaos Monkey

Netflix went the extra mile and built several autonomous agents, so-called “monkeys”, for injecting failures and creating different kinds of outages in an automated manner. Latency Monkey, for example, induces artificial delays in API calls to simulate service degradation, whereas Chaos Gorilla is programmed to take down an entire AWS availability zone. Together these monkeys form the Simian Army.

Chaos Monkey is the most famous member of Netflix’s Simian Army. In fact, it’s the first, and to this date only, monkey of its kind that is publicly available. Broadly speaking, Chaos Monkey randomly terminates EC2 instances in AWS. Here is a more thorough description from the Netflix blog:

[Chaos Monkey is] a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact. The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables – all the while we continue serving our customers without interruption.

The post continues:

By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won’t even notice.

Next, I’ll show you how to run your very own Chaos Monkey.

The Simian Army - Docker Edition

I spent a couple hours and dockerized the Simian Army, a Java application with dozens of settings, to make it as simple as possible to use Chaos Monkey. The result is a highly configurable Docker image which, I hope, provides a sound basis for automating chaos experiments.

As an example, this command will start a Docker container running the Simian Army and instruct Chaos Monkey to consider all auto scaling groups (ASGs) in the given AWS account for termination:

docker run -it --rm \
    -e SIMIANARMY_CLIENT_AWS_ACCOUNTKEY=$AWS_ACCESS_KEY_ID \
    -e SIMIANARMY_CLIENT_AWS_SECRETKEY=$AWS_SECRET_ACCESS_KEY \
    -e SIMIANARMY_CLIENT_AWS_REGION=$AWS_REGION \
    -e SIMIANARMY_CALENDAR_ISMONKEYTIME=true \
    -e SIMIANARMY_CHAOS_ASG_ENABLED=true \
    mlafeldt/simianarmy

This example is safe to run as Chaos Monkey will operate in dry-run mode by default. It’s a good way for getting a feeling of the application without taking a risk.

The second example is more realistic and could very well be your first chaos experiment to run continuously. This time, Chaos Monkey will randomly terminate instances of the auto scaling groups tagged with a specific key-value pair:

docker run -it --rm \
    -e SIMIANARMY_CLIENT_AWS_ACCOUNTKEY=$AWS_ACCESS_KEY_ID \
    -e SIMIANARMY_CLIENT_AWS_SECRETKEY=$AWS_SECRET_ACCESS_KEY \
    -e SIMIANARMY_CLIENT_AWS_REGION=$AWS_REGION \
    -e SIMIANARMY_CALENDAR_ISMONKEYTIME=true \
    -e SIMIANARMY_CHAOS_ASG_ENABLED=true \
    -e SIMIANARMY_CHAOS_ASGTAG_KEY=chaos_monkey \
    -e SIMIANARMY_CHAOS_ASGTAG_VALUE=true \
    -e SIMIANARMY_CHAOS_LEASHED=false \
    mlafeldt/simianarmy

Note that this command will actually unleash the monkey. But don’t worry: you still need to tag your ASGs accordingly for any instances to be killed.

There are many more configuration settings you can pass to the Docker image, including ones to control frequency, probability, and type of terminations. Also, you can (and should) configure Chaos Monkey to send email notifications about terminations. I encourage you to read the documentation to learn more.

As always, it’s a good idea to start small. I strongly recommend testing Chaos Monkey in a staging environment before unleashing it in production.

This article isn’t meant to be a comprehensive guide on operating Chaos Monkey in production. However, I want to at least mention that observability, through monitoring and other means, is very important when it comes to chaos experiments, even more so when they’re automated. We want to eliminate customer impact as quickly as possible.

Manual vs. automated fault injection

Now that running Chaos Monkey is only a single command away, should we stop manual testing altogether?

The answer is hell, no!

Chaos Monkey is a useful tool to discover weaknesses in your systems caused by various kinds of instance failures, failures that you’d otherwise have to inject manually (or by developing your own tools).

GameDay events, on the other hand, bring the whole team together to think about failure modes and conduct chaos experiments, which is ideal to transfer knowledge and foster a shared mindset. It might also be the only way to test more complex scenarios that are hard or impossible to automate.

That being said, both manual and automated fault injection are valuable – and both have limitations. I will continue to apply these two approaches and share my experience with you.

Bring your tools with you

2016-04-19T00:00:00+02:00

Last year I joined Jimdo’s Werkzeugschmiede team as an Infrastructure Toolsmith. In brief, our goal is to provide a platform (PaaS) where all Jimdo developers can deploy and run their Dockerized applications with as little friction as possible. I’m impressed with what the team has built so far, and to be honest, also a bit surprised just how much work goes into building and running such a platform. At the same time, helping the project move forward in the right direction has been a great opportunity for me to learn and experiment with new technologies.

Our platform, which we internally call Wonderland, utilizes Amazon ECS to run Docker containers on a managed cluster of EC2 instances. For the cluster instances, we decided to use CoreOS, a container-focused Linux distribution optimized for large-scale deployments. These are the basic building blocks of our microservices infrastructure. Everything else – from providing a deployment API to integrating with external services for metrics and logging – is our job. (We also have a chat bot called Alice, and of course, it’s not the only reference to Lewis Carroll’s novel in our stack.)

As is usually the case with non-trivial software, there’s always something – be it ever so small – that doesn’t work as expected. The system is misbehaving in some way and you want to figure out what the heck is going on. In that situation, you instinctively fire up your favorite debugging tools, the ones you trust and are comfortable with, no matter the circumstances. The only problem: the tools you need so desperately might not be available in the environment you’re supposed to scrutinize. Now what?

You might be tempted to run apt-get install or yum install to get the programs you want, even on a production server. What could possibly go wrong? Depending on your infrastructure, the answer is either “not much” (after debugging, you throw away the cluster instance and it will be replaced automatically) or “a lot” (you keep using the now tainted system – a unique snowflake among your servers).

With the rise of container technology, and Docker in particular, there’s another appealing option centered around one simple idea: you bring your tools with you.

CoreOS Toolbox

As I said, something always goes wrong; our platform is no exception. For example, one time we had to debug an internal Go application that routes all Docker container output from our cluster instances to Papertrail. For some mysterious reason, the application would stop sending any logs out of the blue. (We have since moved on to using the fluentd logging driver, which turned out to be simpler and more reliable.)

In order to find out whether log data was being sent at all, I checked if tcpdump or ngrep were available under CoreOS. Unfortunately, the answer was no. Now I could have installed the missing tools inside the Docker container running the Go application (via docker exec). However, altering the system you want to observe is generally a bad idea as it may cause certain bugs to disappear or change their behavior (so-called Heisenbugs), making it hard to isolate the actual problem.

That’s one of the reasons why CoreOS prevents you from modifying the data in /usr by mounting it as read-only. This is where system binaries live and you’re not supposed to mess with them. Given that the root filesystem is writable, you could still install binaries to /opt/bin, for instance. However, doing so in an ad hoc way without using something like coreos-cloudinit would add another unique snowflake to your server farm. I’m afraid we’re going round in circles…

Of course, I wouldn’t write this article if there wasn’t some delightful solution: CoreOS comes with a helpful little script called toolbox, which will launch a container specifically for the purpose of bringing in your favorite command-line tools.

By default, the toolbox command will give you a container based on fedora:latest, but you may also use your own custom Docker image that comes with the tools you need. The spawned container will have full system privileges allowing you to inspect anything running on CoreOS – including other containers – with ngrep, tcpdump, and friends.

BYOT

Toolbox itself might not be the most exciting engineering achievement out there. Still, it’s a good example of the bring-your-own-tools (BYOT) pattern. Now that containers are all the rage and many companies are starting to embrace new ways of running services in production, I think it makes sense to also adopt the same technology in other areas like debugging.

I, for my part, like the idea of having my tools at my disposal whenever I need them.

Unintended Consequences

2016-04-06T00:00:00+02:00

Here are three seemingly unrelated stories:

Last week, Turkish President Erdoğan demanded the deletion of a German satirical video mocking him for his controversial actions restricting freedom of speech and other human rights in his country. The German government did not comply with his request. In fact, Erdoğan achieved the exact opposite: his reaction to the video inadvertently drew further public attention to it, making it all the more popular in the rest of the world. (I guess the president has never heard of Barbra Streisand.)
Group chat is great for getting quick feedback, managing crises that require immediate attention (such as outages), and sharing funny pictures of cats. At the same time, group chat can lead to mental fatigue, fear of missing out, and an ASAP culture that is all about now. The companies behind popular chat tools like Slack or HipChat are unlikely to highlight these negative consequences.
Viagra was originally developed to lower blood pressure. Nearly any drug has side effects. Some of them are negative, others are beneficial. Viagra, the little blue pill, turned out to be the first effective treatment against erectile dysfunction – a totally unexpected outcome.

What do these three stories have in common? They are examples of unintended consequences.

Not what you thought

Unintended consequences, sometimes also called unanticipated consequences, are results that are not the ones foreseen by specific actions. In other words, you do X to achieve Y, but what you get instead, or additionally, is Z.

There are three types of unintended consequences, and I already gave you one example for each type in the stories above:

Unexpected benefit: A positive, unexpected benefit. Viagra falls into this category. Its surprising discovery can be attributed in large parts to luck.
Unexpected drawback: A negative, unexpected disadvantage occurring in addition to the desired effect. Group chat, while being a great invention, also suffers from unintended consequences such as mental exhaustion.
Perverse result: A perverse effect contrary to what was originally intended. Erdoğan tried to wipe an unpleasant video from the internet – and it backfired big time.

Now you might wonder how any of this fits into the overall theme of running production systems. Trust me, it matters more than you might think – but read on.

The challenge of web systems

In web systems, or any complex software system for that matter, we have to deal with unintended consequences all the time. While we do enjoy unexpected benefits from time to time – fixing bug X accidentally solves problem Y as well – it’s the unexpected drawbacks that are more apparent, and thus more interesting. In web systems, outages are probably the most common manifestation of unintended consequences.

The majority of outages are self-inflicted. At some point, someone did something and it had an unintended consequence. You push a bad configuration or deploy a buggy Docker image and all of a sudden the website goes down. It has happened to all of us. I, for one, have certainly caused my fair share of outages, despite being a very careful person.

But why is it that deliberate changes to web systems will often have negative consequences?

The challenge is that web systems are inherently complex. These systems are composed of many moving parts that work together to form an intricate whole. There’s a high rate of change and often a variety of processes leading to those changes. This makes it hard – if not impossible – to fully understand how all the bits and pieces resonate with each other under different conditions. Put another way, web systems are largely intractable, which is a major reason why outages – be they self-inflicted or not – are both unavoidable and unpredictable.

(Then there’s also the butterfly effect, which says that small causes can have large effects, but I won’t go into that here.)

Decouple all the things

So the trouble is that we’re thinking about a change we’re going to make, but we don’t necessarily anticipate the negative consequences it might have on the system as a whole. In software engineering, there’s a term for this: coupling. We fail to anticipate difficulties because we don’t think about how coupled a piece of software is to the rest of the system.

If we want to build reliable systems – systems that minimize the risk of self-inflicted outages – we have to remove the coupling. We have to decouple all the things.

A decoupled system allows changes to be made to any one component without having an effect on any other component. By isolating each individual piece, we no longer have to keep all these complicated models in our head. Instead, we only have to know the internals of the one component we want to modify (and the interfaces it uses). This in turn reduces the probability of unintended consequences.

Example: Kubernetes

A prime example of decoupling at work is Kubernetes, the container cluster manager from Google.

Kubernetes makes it easy to build reliable distributed systems by enabling people to create decoupled distributed systems. Developers can, for example, talk to the Kubernetes API to deploy their application containers, and they can do so without having to worry about cluster nodes, host OS/kernel, or underlying hardware. This way, Kubernetes allows us to decouple operations and separate concerns in terms of teams.

Kubernetes is worth a look if you want fewer outages and other surprises in your ops life.

For more on this topic, I highly recommend watching this presentation by Brendan Burns. He inspired me to think about unintended consequences and write about them here.

Complacency: The Enemy of Resilience

2016-03-22T00:00:00+01:00

Being on-call can be a nerve-wracking experience. Something might break any minute. It’s unpredictable. Yet, you’re in charge when a server explodes (and they like to explode in the middle of the night). Knowing this might freak you out. At least that’s what happened to me when I was on-call for the first couple of times after joining Jimdo. As a bloody beginner, I was afraid of what might happen. I was afraid of that buzzing alert sound to ever go off. Would I be able to handle the situation with the little training I had? All I wanted was to survive whatever on-call shift I was on. It stressed me out.

Before moving to the cloud, every engineer who participated in our on-call rotation had to deal with a relatively large number of incidents over time – most of them minor in severity but high on the annoyance scale (lighttpd backend overloaded anyone?). Due to the nature of our architecture, we often experienced cascading failures – failures that spread from one part of the system to another, leading to a good deal of alerts for us to handle. Once a server was in production, we tried to touch it as little as possible – unless something was broken, of course. And things did break all the time.

That was three years ago. A lot has changed since then. We finished migrating our core infrastructure from bare metal to the cloud. AWS is doing an amazing job when it comes to reliability. This and the fact that we had to redesign our software for the cloud – and yes, we no longer use lighttpd – led us to where we are today: outages are rare, our customers are happier, and we get a lot more sleep.

So all is good now, right?

When failure is the rule

It is beyond question that having too many alerts – and fighting too many fires at the same time – is a real problem, as it might lead to alert fatigue, sleep deprivation, and a good amount of frustration. In other words, plenty of stressful situations that used to keep us busy and distracted us from our actual work.

However, the fact that we were dealing with failure on a regular basis also had some advantages:

Everyone knew, for the most part, what to do when shit hit the fan. We all had experience with the most common failure modes. Like a well-oiled machine, we followed pretty much the same procedures when it came to incident communication, documentation, escalation, etc.
Due to the sheer amount of alerts, we knew that our monitoring and alerting systems worked in principle (leaving aside the fact that some alerts were questionable). These days, alerts are so rare that I sometimes wonder if PagerDuty is broken.
Last, but most importantly, we never believed that our system was particularly resilient to failures. On the contrary, we were always wary of what might happen next.

When failure is the exception

On the other end of the spectrum, when failure is the exception rather than the rule, you risk losing every single advantage I outlined above. As John Allspaw put it in his article, Fault Injection in Production:

If a system has a period of little or no degradation, there is a real risk of it drifting toward failure on multiple levels, because engineers can become convinced – falsely – that the system is experiencing no problems because it is inherently safe.

Just because there are no problems now doesn’t mean it’s going to stay that way. Sooner or later, any complex system will fail. It’s inevitable. That’s why you should never get too comfortable.

There’s a word for this very condition: complacency. Merriam-Webster defines complacency as “self-satisfaction especially when accompanied by unawareness of actual dangers or deficiencies”.

Complacency is the enemy of resilience. The longer you wait for disaster to strike in production – merely hoping that everything will be okay – the less likely you are to handle emergencies well, both at a technical and organizational level.

How to fight complacency

Ultimately, our goal is to be aware of actual dangers and to prepare for them accordingly. Learning from outages after the fact is important (and that’s what postmortems are for), but it shouldn’t be the only method for acquiring operational knowledge. Allspaw agrees, as he continues:

[…] building resilient systems requires experience with failure, and that we want to anticipate and confirm our expectations surrounding failure more often, not less often. Shying away from the effects of failure in a misguided attempt to reduce risk will result in poor designs, stale recovery skills, and a false sense of safety.

This observation may come as a surprise for those who believe that systems should never fail, but waiting for things to break in production is not an option. We should rather inject failures proactively in a controlled manner in order to prepare for the worst and gain confidence in our systems. This is the core idea behind Chaos Engineering and related practices like GameDay exercises. Again, Allspaw summed it up best:

[…] failure-inducing exercises can serve as “vaccines” to improve the safety of a system—a small amount of failure injected to help the system learn to recover. It also keeps a concern about failure alive in the culture of engineering teams, and it keeps complacency at bay.

These days, we’re running GameDay exercises periodically at Jimdo – making sure we never get too comfortable. We know that, once again, embracing failure is key to building resilient systems.

How to succeed at infrastructure automation

2016-03-07T00:00:00+01:00

I care a lot about infrastructure automation and the art of turning infrastructure into code. I find pleasure in using and developing build, test, and deployment systems. It’s what I get paid to do every day and can’t stop doing after work. In a sense, I’m obsessed with automation.

For an engineer like me, the questions of what to automate and how to go about it are of particular interest. Alas, I don’t always have the right answer to those questions. That’s when things go wrong and mistakes happen.

Fortunately, mistakes are also a wonderful opportunity to learn – and to eventually succeed.

Screwing up

I’ve been automating tasks of one kind or another for about a decade now. It goes without saying that I made some rookie mistakes along the way. I don’t mean blunders like pushing buggy code or executing commands in the wrong environment (oops). These things are unavoidable. What I mean are more fundamental problems that go beyond mere technical matters.

Let me share three personal anecdotes about screwing up in one way or another:

I once spent three full days writing automated tests for a shell script that performs backups. I wanted to change the script – a beast of hundreds of lines of messy code – without breaking the backup process. After wasting a couple hours trying to tame the test framework to do what I wanted it to do (how hard could it be?), I already knew that this wasn’t going to end well. While I eventually managed to write the tests, they ended up being very brittle and verbose, adding only little confidence. Even worse – and hindsight bias aside – at no time did those three days feel like adding any value. It was just my ego pushing me.

The first configuration management system I learned to use – and still the one I like the most – is Chef. I have a thing for Chef because it’s based on Ruby, which happens to be one of my favorite programming languages. When I started working at Jimdo in 2013, however, I suddenly had to use Puppet. Instead of coming to grips with Puppet and accepting the situation as it was, I spent my first weeks at the new gig ranting about how bad this Puppet thing is and how Chef would magically solve all problems. Of course, this didn’t change anything. I knew that sooner or later I had to learn Puppet if I wanted to work on Jimdo’s infrastructure.

The last mistake is the worst of the three. I failed to deliver because I didn’t care enough about a project as a whole. The project’s goal was to add proper monitoring to our cron jobs. For this, we developed a tool in Go that would allow us to wrap cron jobs and send results to Nagios. It worked out pretty well. The problem: building that tool was only a small but by far the most interesting part of the project. Rather than wrapping up the remaining tasks, I was too busy learning more about the shiny technologies we explored at that time (Go and AWS). This isn’t the full story; suffice it to say I wasn’t part of the team that completed the project later on.

The three rules of infrastructure automation

It goes without saying that I don’t want to repeat any of those mistakes. To achieve that, I’ve turned the lessons I learned over the years into three simple rules. I call them, appropriately, The three rules of infrastructure automation. I’m convinced that by following these rules, I’ve been able to improve my work and become a more successful engineer.

The three rules are:

Don’t blindly automate all the things. Take a step back and evaluate if a task is actually worth the effort. Feel the pain before doing something about it. Ignore the problem for a while; it might not be that big of an issue. If you still decide to invest in automation, you will know exactly where the pain points are. Ask for feedback when you’re stuck. Always be willing to adjust and, more importantly, have the courage to stop what you’re doing and move on.
Treat tools as what they are: a means to an end. Don’t fight over tools or programming languages. Use whatever works for you or your company. In the end, it’s all about creating (business) value. More often than not, tools are not the reason why we fail to reach our goals. Before complaining about software, remember that certain design decisions probably made sense to the people at the time they built it. Be open to learning new things.
Take ownership of your work. First and foremost, do the work and deliver what you promise (while adhering to the first two rules). Get both the interesting and the boring tasks done. Yes, automation isn’t always fun. That’s just how it is. Accept it and move forward. Taking ownership also means to keep caring. Fix bugs. Help others relying on your work. Don’t automate and forget.

I’ve learned to live my professional life by the above rules. They provide me with the right mental framework for approaching automation, if not programming in general. These rules were and continue to be a tremendous help to me. They may help you too.

Chaos Engineering: A Shift in Mindset

2016-02-23T00:00:00+01:00

Last time I introduced you to Chaos Engineering, a discipline based on the idea that proactively triggering failures in a system is the best way to prepare for disaster.

My goal was to show you how to get started running your own chaos experiments, mostly on a technical level. I left out some things by necessity. For example, I only briefly mentioned that it would be a good idea to introduce a company to the concept of Chaos Engineering by starting small, rather than wreaking havoc on production from the get-go. I took it as given that chaos experiments are endorsed by everyone. Of course, that’s not the case.

Chaos Engineering, or engineering in general, doesn’t happen in a vacuum. There are people involved. Chaos Engineering isn’t merely a technical solution to the problem of building resilient systems. For it to work at all, it requires a fundamental shift in the mindset of managers and employees, if not whole companies.

Failure is a hard sell

For many companies, failure is unacceptable. Management considers outages, big or small, to be intolerable. They believe that systems should never fail. And they spend a lot of money in trying to make this dream a reality (while never getting there). Most of their managers, especially those of non-technical departments, will argue that chaos experiments offer little benefit and carry a substantial risk.

In these companies, people are more likely to cover up a mistake than deal with it openly because they’re afraid to be punished for failure. This blame culture is seriously toxic. Not only does it hurt morale, it also hinders innovation through collaboration, thus depriving the opportunity to learn from mistakes together.

And it’s not just managers. For engineers who try to avoid emergencies altogether, it too is difficult to stand by as their systems break in ways they can’t possibly imagine. After all, a deployment pipeline running automated tests ought to be enough, right?

Introducing Chaos Engineering – forcing systems to fail – seems to be impossible in such circumstances. Indeed, it would take more than a technological shift. It would need a change in culture.

Learning to embrace failure

Luckily, there are other companies, like Etsy or Stripe, that have learned to embrace failure. They realize that failure is unavoidable, whether we’re developing software or managing people or doing something else entirely. They understand that the key to building resilient systems is to accept that failure is a part of life. Unsurprisingly, these companies are the first to adopt Chaos Engineering or similar practices, and therefore are better equipped to manage outages and other surprises once they do happen.

I know this because I’m privileged to be working in a place where failure isn’t a disaster, but a learning experience. I wouldn’t go so far as to say that failure is part of our company culture – not yet – but engineers are trusted to do the right thing, with plenty of room for (chaos) experiments. Besides trying to anticipate where something might break in the future, we’re also having blameless postmortems on outages and accidents as a means to learn from past events.

In fact, it happened more than once that I was on-call when something bad happened which I eventually had to escalate to my boss. In the end, we always fixed problems by working together as a team. The inevitable postmortems that followed helped us understand how accidents actually happened – without blaming individuals – and how we might better prepare for the future.

How to make the case for chaos experiments

Again, I know I’m in a privileged position. You might not be so lucky. The challenge is to convince people – the right people – of the value of Chaos Engineering. You want them to see how much there is to learn from failure and that the benefits, such as increased confidence in the system, outweigh the costs.

If you’re having a hard time selling the idea of Chaos Engineering, here are a couple things that may or may not work for you:

Look out for outages and other failures. They can give you the ammunition you need to get the approval for some smaller chaos experiments.
Try to find a sponsor, someone in power who understands the value of resilience testing and is willing to promote it throughout the company.
Suggest to experiment in a non-production environment first (as mentioned before). Even though there are downsides to this – some behaviors can be seen only in production; repairing a broken test environment is often a lot of work – you can expect to get less pushback.
Make your systems fault-tolerant to the best of your ability before running chaos experiments. I hope this is obvious.
Mitigate risk as much as possible and communicate your efforts. Make sure to carefully review experiments, monitor them closely, and have experts at hand who can revert changes should things go wrong.
Keep stakeholders in the loop. Announce experiments early enough. They can be risky and disruptive. You don’t want to lose trust before having a real chance to earn it.
Share insights with other teams. Help them get started with their own experiments. More teams are likely to join once they see the results.
If successful, tell everyone how Chaos Engineering was instrumental in improving resilience. Advertise it as much as you can!

I hope this helps.

Building resilient systems is an ongoing, iterative process. Making a compelling case for Chaos Engineering is a good first step in changing the prevailing mindset that failure is a bad thing.

Chaos Engineering 101

2016-02-10T00:00:00+01:00

Say you’re developing a new web application – the next great thing everybody has been waiting for. After all the work you’ve done, it’s time to finally launch the service to the first customers. Now the hard question:

When do you know that the application is ready for production? More specifically, how can you be sure that the (distributed) system you’ve built is resilient enough to survive use in production?

The truth is: you can never be sure. You don’t know what’s going to happen. There will always be something that can – and will – go wrong, from self-inflicted outages caused by bad configuration pushes or buggy images to events that are outside your control like denial-of-service attacks or network failures. No matter how hard you try, you can’t build perfect software (or hardware, for that matter). Nor can the companies you depend on.

We live in an imperfect world. That’s just how it is. Accept it and focus on the things you can control: creating a quality product that is resilient to failures. Build software that is able to cope with both expected and unexpected events; gracefully degrade whenever necessary. As the saying goes, “Hope for the best and prepare for the worst”.

But how? How can you make sure you’re ready for disaster?

The first thing you need to do is to identify problems that could arise in production. Only then will you be able to address systemic weaknesses and make your systems fault-tolerant.

This is where Chaos Engineering comes in.

Principles of Chaos Engineering

Rather than waiting for things to break in production at the worst time, the core idea of Chaos Engineering is to proactively inject failures in order to be prepared when disaster strikes.

Netflix, a pioneer in the field of automated failure testing (and, by the way, also a great video-streaming service), defines Chaos Engineering as follows:

Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

As a Chaos Engineer, you test a system’s ability to survive failures by simulating potential errors (aka failure modes) in a series of controlled experiments. These experiments typically consist of four steps:

Define the system’s normal behavior – its “steady state” – based on measurable output like overall throughput, error rates, latency, etc.
Hypothesize about the steady state behavior of an experimental group, as compared to a stable control group.
Expose the experimental group to simulated real-world events such as server crashes, malformed responses, or traffic spikes.
Test the hypothesis by comparing the steady state of the control group and the experimental group. The smaller the differences, the more confidence we have that the system is resilient.

Or, to put it in less scientific terms: intentionally break things, compare measured with expected impact, and correct any problems uncovered this way.

As an example, let’s say you want to know what happens if, for some reason, your MySQL database isn’t available. You hypothesize that, in this case, your web application would stop serving requests, immediately returning an error instead. To simulate the event, you block access to the database server. Afterwards, however, you observe that the app seems to take forever to respond. After some investigation, you find the cause – a misconfigured timeout – and fix it in a matter of minutes.

As this example demonstrates, Chaos Engineering makes for effective resilience testing. Besides, it’s a ton of fun, but read on.

How to get started

Netflix went the extra mile and built several autonomous agents, so-called “monkeys”, for injecting failures and creating different kinds of outages. For example, Chaos Monkey randomly terminates virtual machines, Latency Monkey induces artificial delays in API calls to simulate service degradation, and Chaos Gorilla is programmed to take down an entire datacenter. Together they form the Simian Army.

While the Simian Army might be a novel concept, you don’t need to automate experiments to run continuously when you’re just getting started. Besides, it is best to introduce a company to the concept of Chaos Engineering by starting small.

So rather than wreaking havoc on your production system from day one, start by experimenting in an isolated staging environment (if you don’t have a pre-production environment yet, now would be the perfect time to create one). While the two environments are likely to be different in more or less subtle ways, any resilience testing is better than no resilience testing. Later, when you feel more confident, conduct some of the experiments – or preferably all of them – in production. Remember: Chaos Engineering is focused on controlled failure-injection. You make the rules!

The purpose of our chaos experiments is to simulate disaster conditions. This might sound like a difficult task – and yes, it does require a lot of creativity – but in the beginning it’s easiest to focus on availability, or rather the lack of it: inject failures so that certain pieces of your infrastructure become unavailable. Intentionally terminate cluster machines, kill worker processes, delete database tables, cut off access to internal and external services, etc. This way you can learn a lot about the coupling of your system and discover subtle dependencies you would otherwise overlook.

Later on you might want to simulate other events capable of disrupting steady state, like high latency caused by slow network performance. These experiments are generally harder to pull off and often require special tooling, but the takeaways are worth the extra effort.

Whatever you decide to do, you’ll be surprised how much you can learn from chaos.

Plan, execute, measure, adjust

Briefly, here are the steps involved in conducting chaos experiments. This list is based on my own experience in participating in GameDay events at Jimdo.

Start by planning the experiments. Compile a list of potential failure modes, how you want to simulate them, and what you think the impact will be. I recommend using a spreadsheet.
Pick a date. Inform stakeholders of affected systems, especially if you anticipate any trouble for customers.
Gather the whole team in front of a big screen and go through the experiments together. This is the best way to transfer knowledge and develop a shared mindset.
After each experiment, write down the actual measured impact.
For each discovered flaw, put together a list of counter measures. Don’t implement them right away! Add any follow-up items to your issue tracker.

Make sure to repeat the experiments on a regular basis (at least once every quarter) to detect regressions as well as new problems. Don’t forget to bring your spreadsheet.

I hope you enjoyed this introduction to Chaos Engineering – a powerful, if somewhat radical, approach to building resilient systems.

As always, the best way to internalize new concepts is by practice. Therefore, start running your own chaos experiments! It’s well worth it.

Update: Since writing this article, my understanding of Chaos Engineering has evolved. Read The Discipline of Chaos Engineering for a more recent take on the topic.

Embedding Assets in Go (Codeship)

2015-11-05T00:00:00+01:00

I wrote another article for Codeship: Embedding assets in Go

If the story sounds familiar, it’s because the article is based on a blog post of mine with the same title. I added lots of content, almost doubling the number of words in the process. I also tried to improve the overall reading experience by fixing what I thought would be good writing back when I conceived the original piece. (Turns out revising old posts is very similar to refactoring legacy code in the amount of pain you will experience.)

I hope you will enjoy my latest Codeship article.

Have a different or better approach? Reach out to me on Twitter.

Using Docker to Build Debian Packages (Codeship)

2015-08-18T00:00:00+02:00

TL;DR: I wrote an article for the Codeship blog.

About two months ago, Manuel Weiss asked me if I would be interested in writing original content for the Codeship blog. Manuel is one of Codeship’s co-founders; I felt honored that he reached out to me. I was thrilled by the opportunity to guest post on a well-known tech blog, which is something that aligns well with my writing goals. (To be honest, getting paid for my work certainly didn’t hurt either.)

Now all I needed was an engaging topic for the blog’s audience. Docker is all the rage these days and I already had a lot of experience in using it for building Debian packages. And so, unsurprisingly, I decided to write about that. I crafted a short outline and pitched it to their editor. They accepted my idea and I was off writing…

Head over to the Codeship blog and read the final article, Using Docker to build Debian packages, to learn how and why I developed a build system for Debian packages at Jimdo.

PS: While I finished the article in time, I made my life much harder than it had to be. Let me give you this piece of advice: working on an article with a deadline and moving to another home should not be done at the same time.

Update: Reactions

Looks like my first article for Codeship has been well received. It was featured in Docker Weekly and Changelog Weekly. People also said some nice things about it on Twitter:

"Using Docker to Build Debian Packages" - via @mlafeldt. Thank you, Mathias! http://t.co/UYJsyH1GF7 pic.twitter.com/MkOgj3OoEj
— Codeship (@codeship) August 18, 2015

Proud to have @mlafeldt publish a post for our blog. I love his writing! "Using @Docker to Build Debian Packages" http://t.co/leo2g45q3h
— Manuel Weiss (@manualwise) August 18, 2015

Awesome post on building @Debian packages w/ @Docker by @mlafeldt via @codeship: http://t.co/KlzWGX77gM #DockerWeekly pic.twitter.com/mhXrVV6ZFI
— Docker (@docker) August 22, 2015

Cook your own packages: Getting more out of fpm (SysAdvent)

2014-12-15T00:00:00+01:00

When it comes to building packages, there is one particular tool that has grown in popularity over the last years: fpm. fpm’s honorable goal is to make it as simple as possible to create native packages for multiple platforms, all without having to learn the intricacies of each distribution’s packaging format (.deb, .rpm, etc.) and tooling.

With a single command, fpm can build packages from a variety of sources including Ruby gems, Python modules, tarballs, and plain directories. Here’s a quick example showing you how to use the tool to create a Debian package of the AWS SDK for Ruby:

$ fpm -s gem -t deb aws-sdk
Created package {:path=>"rubygem-aws-sdk_1.59.0_all.deb"}

It is this simplicity that makes fpm so popular. Developers are able to easily distribute their software via platform-native packages. Businesses can manage their infrastructure on their own terms, independent of upstream vendors and their policies. All of this has been possible before, but never with this little effort.

In practice, however, things are often more complicated than the one-liner shown above. While it is absolutely possible to provision production systems with packages created by fpm, it will take some work to get there. The tool can only help you so far.

In this post we’ll take a look at several best practices covering: dependency resolution, reproducible builds, and infrastructure as code. All examples will be specific to Debian and Ruby, but the same lessons apply to other platforms/languages as well.

Resolving dependencies

Let’s get back to the AWS SDK package from the introduction. With a single command, fpm converts the aws-sdk Ruby gem to a Debian package named rubygem-aws-sdk. This is what happens when we actually try to install the package on a Debian system:

$ sudo dpkg --install rubygem-aws-sdk_1.59.0_all.deb
...
dpkg: dependency problems prevent configuration of rubygem-aws-sdk:
 rubygem-aws-sdk depends on rubygem-aws-sdk-v1 (= 1.59.0); however:
  Package rubygem-aws-sdk-v1 is not installed.
...

As we can see, our package can’t be installed due to a missing dependency (rubygem-aws-sdk-v1). Let’s take a closer look at the generated .deb file:

$ dpkg --info rubygem-aws-sdk_1.59.0_all.deb
 ...
 Package: rubygem-aws-sdk
 Version: 1.59.0
 License: Apache 2.0
 Vendor: Amazon Web Services
 Architecture: all
 Maintainer: <vagrant@wheezy-buildbox>
 Installed-Size: 5
 Depends: rubygem-aws-sdk-v1 (= 1.59.0)
 Provides: rubygem-aws-sdk
 Section: Languages/Development/Ruby
 Priority: extra
 Homepage: http://aws.amazon.com/sdkforruby
 Description: Version 1 of the AWS SDK for Ruby. Available as both `aws-sdk` and `aws-sdk-v1`.
  Use `aws-sdk-v1` if you want to load v1 and v2 of the Ruby SDK in the same
  application.

fpm did a great job at populating metadata fields such as package name, version, license, and description. It also made sure that the Depends field contains all required dependencies that have to be installed for our package to work properly. Here, there’s only one direct dependency – the one we’re missing.

While fpm goes to great lengths to provide proper dependency information – and this is not limited to Ruby gems – it does not automatically build those dependencies. That’s our job. We need to find a set of compatible dependencies and then tell fpm to build them for us.

Let’s build the missing rubygem-aws-sdk-v1 package with the exact version required and then observe the next dependency in the chain:

$ fpm -s gem -t deb -v 1.59.0 aws-sdk-v1
Created package {:path=>"rubygem-aws-sdk-v1_1.59.0_all.deb"}

$ dpkg --info rubygem-aws-sdk-v1_1.59.0_all.deb | grep Depends
 Depends: rubygem-nokogiri (>= 1.4.4), rubygem-json (>= 1.4), rubygem-json (<< 2.0)

Two more packages to take care of: rubygem-nokogiri and rubygem-json. By now, it should be clear that resolving package dependencies like this is no fun. There must be a better way.

In the Ruby world, Bundler is the tool of choice for managing and resolving gem dependencies. So let’s ask Bundler for the dependencies we need. For this, we create a Gemfile with the following content:

# Gemfile
source "https://rubygems.org"
gem "aws-sdk", "= 1.59.0"
gem "nokogiri", "~> 1.5.0" # use older version of Nokogiri

We then instruct Bundler to resolve all dependencies and store the resulting .gem files into a local folder:

$ bundle package
...
Updating files in vendor/cache
  * json-1.8.1.gem
  * nokogiri-1.5.11.gem
  * aws-sdk-v1-1.59.0.gem
  * aws-sdk-1.59.0.gem

We specifically asked Bundler to create .gem files because fpm can convert them into Debian packages in a matter of seconds:

$ find vendor/cache -name '*.gem' | xargs -n1 fpm -s gem -t deb
Created package {:path=>"rubygem-aws-sdk-v1_1.59.0_all.deb"}
Created package {:path=>"rubygem-aws-sdk_1.59.0_all.deb"}
Created package {:path=>"rubygem-json_1.8.1_amd64.deb"}
Created package {:path=>"rubygem-nokogiri_1.5.11_amd64.deb"}

As a final test, let’s install those packages…

$ sudo dpkg -i *.deb
...
Setting up rubygem-json (1.8.1) ...
Setting up rubygem-nokogiri (1.5.11) ...
Setting up rubygem-aws-sdk-v1 (1.59.0) ...
Setting up rubygem-aws-sdk (1.59.0) ...

…and verify that the AWS SDK actually can be used by Ruby:

$ ruby -e "require 'aws-sdk'; puts AWS::VERSION"
1.59.0

Win!

The purpose of this little exercise was to demonstrate one effective approach to resolving package dependencies for fpm. By using Bundler – the best tool for the job – we get fine control over all dependencies, including transitive ones (like Nokogiri, see Gemfile). Other languages provide similar dependency tools. We should make use of language specific tools whenever we can.

Build infrastructure

After learning how to build all packages that make up a piece of software, let’s consider how to integrate fpm into our build infrastructure. These days, with the rise of the DevOps movement, many teams have started to manage their own infrastructure. Even though each team is likely to have unique requirements, it still makes sense to share a company-wide build infrastructure, as opposed to reinventing the wheel each time someone wants to automate packaging.

Packaging is often only a small step in a longer series of build steps. In many cases, we first have to build the software itself. While fpm supports multiple source formats, it doesn’t know how to build the source code or determine dependencies required by the package. Again, that’s our job.

Creating a consistent build and release process for different projects across multiple teams is hard. Fortunately, there’s another tool that does most of the work for us: fpm-cookery. fpm-cookery sits on top of fpm and provides the missing pieces to create a reusable build infrastructure. Inspired by projects like Homebrew, fpm-cookery builds packages based on simple recipes written in Ruby.

Let’s turn our attention back to the AWS SDK. Remember how we initially converted the gem to a Debian package? As a warm up, let’s do the same with fpm-cookery. First, we have to create a recipe.rb file:

# recipe.rb
class AwsSdkGem < FPM::Cookery::RubyGemRecipe
  name    "aws-sdk"
  version "1.59.0"
end

Next, we pass the recipe to fpm-cook, the command-line tool that comes with fpm-cookery, and let it build the package for us:

$ fpm-cook package recipe.rb
===> Starting package creation for aws-sdk-1.59.0 (debian, deb)
===>
===> Verifying build_depends and depends with Puppet
===> All build_depends and depends packages installed
===> [FPM] Trying to download {"gem":"aws-sdk","version":"1.59.0"}
...
===> Created package: /home/vagrant/pkg/rubygem-aws-sdk_1.59.0_all.deb

To complete the exercise, we also need to write a recipe for each remaining gem dependency. This is what the final recipes look like:

# recipe.rb
class AwsSdkGem < FPM::Cookery::RubyGemRecipe
  name       "aws-sdk"
  version    "1.59.0"
  maintainer "Mathias Lafeldt <mathias@example.com>"

  chain_package true
  chain_recipes ["aws-sdk-v1", "json", "nokogiri"]
end

# aws-sdk-v1.rb
class AwsSdkV1Gem < FPM::Cookery::RubyGemRecipe
  name       "aws-sdk-v1"
  version    "1.59.0"
  maintainer "Mathias Lafeldt <mathias@example.com>"
end

# json.rb
class JsonGem < FPM::Cookery::RubyGemRecipe
  name       "json"
  version    "1.8.1"
  maintainer "Mathias Lafeldt <mathias@example.com>"
end

# nokogiri.rb
class NokogiriGem < FPM::Cookery::RubyGemRecipe
  name       "nokogiri"
  version    "1.5.11"
  maintainer "Mathias Lafeldt <mathias@example.com>"

  build_depends ["libxml2-dev", "libxslt1-dev"]
  depends       ["libxml2", "libxslt1.1"]
end

Running fpm-cook again will produce Debian packages that can be added to an APT repository and are ready for use in production.

Three things worth highlighting:

fpm-cookery is able to build multiple dependent packages in a row (configured by chain_* attributes), allowing us to build everything with a single invocation of fpm-cook.
We can use the attributes build_depends and depends to specify a package’s build and runtime dependencies. When running fpm-cook as root, the tool will automatically install missing dependencies for us.
I deliberately set the maintainer attribute in all recipes. It’s important to take responsibility of the work that we do. We should make it as easy as possible for others to identify the person or team responsible for a package.

fpm-cookery provides many more attributes to configure all aspects of the build process. Among other things, it can download source code from GitHub before running custom build instructions (e.g. make install). The fpm-recipes repository is an excellent place to study some working examples. This final example, a recipe for chruby, is a foretaste of what fpm-cookery can actually do:

# recipe.rb
class Chruby < FPM::Cookery::Recipe
  description "Changes the current Ruby"

  name     "chruby"
  version  "0.3.8"
  homepage "https://github.com/postmodern/chruby"
  source   "https://github.com/postmodern/chruby/archive/v#{version}.tar.gz"
  sha256   "d980872cf2cd047bc9dba78c4b72684c046e246c0fca5ea6509cae7b1ada63be"

  maintainer "Jan Brauer <jan@example.com>"

  section "development"

  config_files "/etc/profile.d/chruby.sh"

  def build
    # nothing to do here
  end

  def install
    make :install, "PREFIX" => prefix
    etc("profile.d").install workdir("chruby.sh")
  end
end

# chruby.sh
source /usr/share/chruby/chruby.sh

Wrapping up

fpm has changed the way we build packages. We can get even more out of fpm by using it in combination with other tools. Dedicated programs like Bundler can help us with resolving package dependencies, which is something fpm won’t do for us. fpm-cookery adds another missing piece: it allows us to describe our packages using simple recipes, which can be kept under version control, giving us the benefits of infrastructure as code: repeatability, automation, rollbacks, code reviews, etc.

Last but not least, it’s a good idea to pair fpm-cookery with Docker or Vagrant for fast, isolated package builds. This, however, is outside the scope of this article and left as an exercise for the reader.

Infrastructure automation by example (Practicing Ruby)

2013-11-12T00:00:00+01:00

Mathias Lafeldt

Giving Rust Another Shot in 2020

That was then

This is now

7 Things I Learned From Porting a C Crypto Library to Rust

Soft migration via FFI

Wrapping operations

Transmuting types

RSA with num-bigint

No stdlib

mod_inverse using Newton’s method

Automated testing with actions-rs

Recreational Programming with Serverless

Freelancer by Choice

The Limitations of Chaos Engineering

A Means to an End

One Step Forward, Two Steps Back

One Among Many

Too Brittle or Too Reliable

Systems Will Continue to Fail

The Paradox of Automated Fault Injection

Infinite Variability

You Can’t Have a Rollback Button

Wrapping Up

Antifragility 101

Resilience by Example

The Antifragile Gets Better

Potential Downside < Potential Upside

Chaos Engineering

Impermanence: The Single Root Cause

A Primer on Automating Chaos (Gremlin)

Learning From Failure and Success Through Postmortems

Learning through postmortems

All actions are gambles

We get most things right

Further reading

What I Learned From Hacking Video Games

Embracing Failure in a Container World

Go, Mental Models, and Side Effects

The virtues of Go

Mental models 101

Systems blindness

Getting Things Right With Checklists

Man is fallible

Teamwork

What makes a good checklist?

Adoption

A production-readiness checklist

Premortems: The Art of Negative Visualization

The Discipline of Chaos Engineering (Gremlin)

There's nothing like a good spike

Every Day We Must Sweep

Ego is the enemy

Being a beginner again

Breaking Things on Purpose (Gremlin)

Sometimes Boring Is Better

The Pros and Cons of Eating Your Own Dog Food

Perfectionism and Programming

Writing

Good enough

Tools and practices

Implementing Semantic Monitoring

The difficulty of monitoring microservices

Implementing semantic monitoring

Using Chaos Monkey whenever you feel like it

On-demand termination

CLI goodness

Chaos Monkey at Jimdo

The Burden of Running Systems

You’re not paid to write code

Questions, questions, questions

Solving the right problem

Always leave the campground cleaner than you found it

Broken windows

Refactor early, refactor often

Incremental refactoring

The Boy Scout Rule

From Zero to Staging and Back

Pair programming

One account per environment