6 minute read

Today I want to share a little story. The story is about the challenge of making one component of our core infrastructure more resilient to failures.

Before jumping to the meat of it, however, it will be helpful to have an understanding of what the infrastructure stack in question does and how it works.

A quick look at Wonderland

I’m currently working as an Infrastructure Toolsmith in Jimdo’s Werkzeugschmiede team. In brief, our goal is to provide a platform (PaaS) where all Jimdo developers can deploy and run their Dockerized applications with as little friction as possible.

Our platform, which we internally call Wonderland, utilizes the Amazon EC2 Container Service (ECS). ECS is essentially a cluster scheduler that maps a set of work (batch jobs or long-running services) to a set of resources (EC2 instances). To connect instances to an ECS cluster, one has to install and run a daemon called ecs-agent on them. ecs-agent will then start Docker containers on behalf of ECS. Wonderland, in turn, talks to the ECS API whenever a user triggers a deployment.

There are actually more moving pieces involved, but that’s the gist of it.

What can go wrong

For our setup to work, it’s important that the following conditions are met on each cluster instance:

  • ecs-agent is running properly to get instructions from ECS.
  • Docker daemon is running properly to start containers on behalf of ecs-agent.
  • /var/lib/docker is backed by a functioning Amazon EBS volume with enough storage for Docker.

According to Murphy’s law, anything that can go wrong will go wrong. Here are some of the problems we had to deal with in the past:

  • Deploying a newer (and broken) version of ecs-agent that would disconnect from ECS after a while. We did a rollback. These days, we do more testing before updating ecs-agent in production.
  • Docker daemon becoming unresponsive. After lots of debugging, we found out this was due to Datadog agent collecting disk usage metrics per container. After disabling the corresponding “container_size” metric the problem was gone.
  • EBS volume running out of inodes (“no space left on device”), which was caused by Docker containers with many small files in them. The fix was to switch the underlying filesystem from EXT4 to XFS.
  • Formatting of EBS volume failed with “mkfs.xfs: /dev/xvdb appears to contain an existing filesystem (xfs)”. Without the mounted device, Docker quickly filled up the small root filesystem. Only happened once, though. We ignored it.

Self-healing infrastructure

Of course, we want our infrastructure to be resilient to failures. If an EC2 instance becomes unhealthy – if it’s unable to run Docker containers for whatever reason – it should be replaced without any manual intervention.

To achieve this, all of our cluster instances are managed by an EC2 auto scaling group. The associated load balancer is configured to send an HTTP request to each instance to perform a health check (the specific endpoint is :51678/v1/metadata). If ecs-agent does not respond with “200 OK” for some time, the instance will be terminated and a new one will be started immediately.

That’s the theory. In practice, monitoring ecs-agent alone is not enough, as the used endpoint will happily report that everything is fine even if, for example, /var/lib/docker isn’t writable…

Now what?

systemd to the rescue

For the cluster instances, we decided to use CoreOS, a container-focused Linux distribution optimized for large-scale deployments. Like most modern Linux distributions, CoreOS uses systemd as its init system.

These are the systemd unit files we use that are of interest:

  • format-ephemeral.service – formats the EBS volume using mkfs.xfs
  • var-lib-docker.mount – mounts the EBS volume to /var/lib/docker
  • docker.service – starts the Docker daemon (this unit ships with CoreOS)
  • ecs-agent.service – starts ecs-agent

From the beginning, we had requirement dependencies in place (via Requires=) to ensure units are started and stopped at the right time. For example, var-lib-docker.mount requires format-ephemeral.service, and ecs-agent.service requires docker.service to work.

We want to take advantage of the fact that systemd will stop ecs-agent if any of its dependencies fails, which would cause the unhealthy instance to be replaced. Self-healing infrastructure for the win!

The attentive reader will already have noticed that – until recently – we had been missing one important dependency in the chain: ecs-agent, which provides our health check endpoint, absolutely requires the EBS volume to be formatted and read-write mounted to /var/lib/docker. Without the volume, the cluster instance can’t work reliably.

I tried to remedy this fact by adding Requires=var-lib-docker.mount to ecs-agent.service. Some testing showed, however, that the agent won’t be stopped if the mount point is unmounted at runtime. Luckily, one of my colleagues suggested to give BindsTo= a try instead:

Configures requirement dependencies, very similar in style to Requires=, however in addition to this behavior, it also declares that this unit is stopped when any of the units listed suddenly disappears. Units can suddenly, unexpectedly disappear if a service terminates on its own choice, a device is unplugged or a mount point unmounted without involvement of systemd.

Which turned out to be what we needed. With BindsTo= in place, I executed umount /var/lib/docker and systemctl stop var-lib-docker.mount, and in both cases ecs-agent was stopped by systemd – and the instance was terminated!

So all is well. Or is it?

Chaos Monkey and XFS

To be honest, neither unmounting the filesystem nor stopping the mount unit are good ways to simulate a failure of EBS. I knew how to do better, though.

As a big fan of Chaos Monkey, I couldn’t resist using it for a quick chaos experiment: What would happen if, instead of unmounting the EBS volume gracefully, it was forcefully detached at runtime? I assumed that the result would be the same: systemd would stop ecs-agent.

To validate my assumption, I used a little command-line tool I’ve developed to trigger chaos events via the Chaos Monkey REST API:

$ chaosmonkey -endpoint "$WONDERLAND_ENDPOINT" \
    -username "$WONDERLAND_USER" -password "$WONDERLAND_PASS" \
    -group wonderland-crims-CrimsAutoScalingGroup \
    -strategy DetachVolumes

The “DetachVolumes” chaos strategy will force-detach all EBS volumes from a randomly selected EC2 instance of the given auto scaling group. As a result, EBS disk I/O will fail.

The result was surprising, at least to me.

You can read up the details here. In a nutshell, systemd did not terminate ecs-agent because of XFS. If an error was detected, the Linux XFS driver – by design – won’t unmount the filesystem in order to keep it consistent (reading or writing data will result in I/O errors). Hence, systemd correctly reported that the device was still mounted, leaving us with a broken system that won’t be able to start new Docker containers…

Conclusion

Unfortunately, we cannot rely on our current health check to correctly detect EBS failures. To make up for this, we probably need to provide our own monitoring daemon for more sophisticated checks. This is something we wanted to avoid because, as the saying goes, no code is better than no code.

I myself have learned a lot about systemd, EBS, and XFS. Even though our infrastructure isn’t perfect yet, we’re more confident in our system now than we were prior to the experiments. It’s not the first time Chaos Engineering has proved our assumptions wrong.

Still, there’s much to learn from failure.

Updated: