The main idea of Chaos Engineering, we recall, is to trigger failures proactively in a controlled way to gain confidence that our production systems can withstand those failures. Chaos Engineering enables us to verify that our systems behave as we expect – and to fix them if they don’t.
In a previous article, I showed you how to use Chaos Monkey for automating your first chaos experiment. For this purpose, I’ve created a customizable Docker image of the Simian Army (which Chaos Monkey is part of) as a solid foundation for running experiments.
Netflix originally designed Chaos Monkey to terminate EC2 instances randomly during business hours. To that end, the tool comes with a good deal of configuration settings to control frequency, probability, type of terminations, and a lot more.
However, you don’t need to automate experiments to run continuously in order to benefit from Chaos Engineering (but doing so may further increase confidence in your systems). At Jimdo, we don’t use Chaos Monkey in a conventional way. In fact, the service is idle most of the time, waiting for instructions from us.
Many people don’t know that, in addition to scheduled instance terminations, Chaos Monkey also supports killing instances on demand via its built-in REST API. So instead of inflicting chaos on your servers at random, e.g. once an hour between 10am and 5pm, it’s on you to decide if and when the monkey will perform a destructive action, without having to follow any imposed schedule.
Want to give it a try? With the mentioned Docker image, setup is a breeze. This single command will start a Chaos Monkey that is ready to take API requests on port 8080 (don’t worry about scheduled terminations – they’re deactivated in this example):
Afterward, you can talk to the API to trigger and retrieve instance terminations, or “chaos events”, as they are called here. For example, this will give you a list of past chaos events:
Triggering failures via the API is a bit more involved, and I won’t go into the details here. Instead, I’d like to promote a handy command-line tool I’ve written for that purpose.
Originally developed for controlled failure injection during GameDays at Jimdo, the tool could also be described as “Chaos Monkey whenever you feel like it”.
Let’s use it to send some commands to the Chaos Monkey we just started with
docker run. First, and most important, here’s how to trigger a new chaos event. This will block all network access to a random instance of the given EC2 auto scaling group:
Sometimes it’s also convenient to terminate multiple instances of an auto scaling group, e.g. to test under what conditions a cluster loses quorum. In this example, we’re going to shut down three cluster instances at intervals of 30 seconds:
It’s also straightforward to list past chaos events as Chaos Monkey keeps track of everything:
There are a couple more features not shown here for brevity. The AWS integration, for instance, allows you to list the auto scaling groups for a given AWS account and to wipe Chaos Monkey’s state if you want to start over. I encourage you to read the documentation for further details, including installation instructions.
Chaos Monkey at Jimdo
Now that I told you about the REST API and the CLI tool, I want to share how we deploy and run Chaos Monkey at Jimdo. Here are the facts:
- Chaos Monkey is just another service running on Jimdo’s PaaS
- For deployment, we use the Docker image mentioned above
- We run one monkey in production and one in our staging environment
- We use
:8080/simianarmy/for the HTTP health check
- A Nginx auth proxy protects the API endpoint (which is public on our platform)
- For high availability, we deploy two service replicas behind an ELB (this works because we only use the REST API, no scheduled terminations)
- We get Slack notifications for all terminations
chaosmonkeytool is installed on our bastion hosts, preconfigured and ready to use for chaos experiments
That’s about it.
At this point, you may wonder if we really need all this complexity to kill some EC2 instances once in a while? Maybe not. It depends on your infrastructure and the type of chaos testing you’re doing. For us, this setup makes sense given the PaaS we have in place and the kind of GameDay exercises we’ve been performing on a regular basis. At the same time, I won’t recommend replacing Chaos Monkey and its many off-the-shelf features with a shell script. That’s just not sustainable in the long run.
Resilience testing should be second nature to engineers. It’s something we should be doing more often – without fear – and open source tools like Chaos Monkey facilitate this goal. I like the idea of having it at my disposal whenever I need it.