From zero to staging and back

4 minute read

Update (November 2016): There’s now an updated and revised version of this article.

The primary use of a staging environment is to test all installation/configuration/migration scripts and procedures, before they are applied to production environment. This ensures that all major and minor upgrades to the production environment will be completed reliably without errors, in minimum time.

The first major task I took on after joining the Werkzeugschmiede team at Jimdo was building a staging environment for Wonderland, our in-house PaaS for microservices. Having a pre-production environment for testing prior to deploying to production is invaluable. It’s a safety net that makes the whole deployment process a lot less scary. We knew that it would give us more confidence to experiment, fix bugs, and implement new features. Given all these benefits, creating a staging environment for Wonderland was long over due. Here’s how we did it.

Pair programming

From the beginning, we’ve employed pair programming. Building a staging environment that mirrors production as closely as possible, and doing this with Paul who knows Wonderland inside out, was a great way to learn about the system and its components. I was able to ask questions when something was unclear and, at the same time, contribute my own ideas whenever I felt like it. This way, we created a fast feedback loop that not only helped me find my way through Wonderland, but also learn about its creators and their modus operandi.

Taken all together, I highly recommend pairing for onboarding new team members – even if you don’t have the luxury of building a production-like environment from scratch.

One VPC per environment

Wonderland’s infrastructure is hosted on AWS. Rather than using a single VPC for both production and staging, we agreed to operate a dedicated VPC per environment (via separate AWS accounts). This setup is an effective way to isolate environments from one another. Most importantly, it prevents changes done in staging – whether intentionally or by mistake – from affecting production. Other advantages of having different VPCs are:

  • Finer access control on VPC level
  • An easier understanding of the (cloned) infrastructure
  • Simpler automation code with fewer exceptions

Working with multiple AWS accounts makes credential management a bit more involved, though. That’s why we’re using awsenv, a tool by my coworker Knut, to quickly switch between accounts.

Besides AWS, we also created separate accounts for all third-party services, like Papertrail and Datadog. If we do it, we do it right.

Automate all the things

We spent a lot of time automating the setup of our staging environment. We managed to get to the point where we were able to run make stage in our repository and Ansible would take care of everything, from bootstrapping our ECS cluster to provisioning Jenkins – our central state enforcer – to running essential Docker containers.

To achieve that, we took the existing Ansible playbooks and CloudFormation templates for production and adapted them for use in staging. This meant that we had to:

  • Replace many hardcoded parameters like URLs and secrets
  • Implement all missing automation steps (some prod resources had been clicked)
  • Address any issues that came up along the way (two words: eventual consistency)

Even after sorting this out, there’s only one way to find out whether our automation actually works as expected: creating staging from scratch, again and again.

Destroy all the things

In addition to make stage, we also implemented the inverse operation, make destroy-stage, to completely destroy staging. This boils down to deleting all CloudFormation stacks as well as all other resources created by Ansible – in reverse order of creation. Tearing down CloudFormation stacks is usually straightforward. However, shelling out to the AWS CLI in Ansible can quickly lead to dependencies that are hard to remove. And even if Ansible does provide a specific AWS module, there’s no guarantee that the “absent” state is implemented properly (I’m looking at you, iam module).

Once make destroy-stage did what we wanted it to do, we were able to bootstrap staging from scratch. This in turn allowed us to verify that our infrastructure code produces the correct results when starting from a clean system.

To further automate things, I created one Jenkins jobs in prod to automatically destroy staging every Friday night and another one to rebuild it on Monday morning. This way, we’ll gain even more confidence in our code and, as a nice side effect, save a bit of money.

Some drawbacks

While I’m quite happy with what we’ve achieved so far, there’s still room for improvement, namely:

  • Having a single staging environment for all members of the Werkzeugschmiede team means that we need to coordinate testing. This might lead to problems once our team consists of more than three engineers. Nobody likes to wait.

  • If it were possible, I would like to rebuild production from scratch, too. Wonderland has been growing over time. To be honest, I’d like to trust the production VPC more than I do now.

  • It currently takes about 90 minutes to rebuild staging. We could speed up the process, for example by prebaking AMIs for our cluster instances. Again, nobody likes to wait.