Automated testing: 5 lessons from Atlassian’s Kubernetes team on testing infrastructure as code

Today, the infrastructure world is moving much faster than ever before. It's easy to provision hundreds of servers in seconds and have them online and usable within minutes. With all of that power comes the ability to iterate faster – and the potential for more bugs, more incidents, and more stress.

You've felt the pain of an incident. Maybe you felt it as a customer, waiting not-so-patiently for a key service to come back online so that you can finish that urgent task. Or maybe you felt it as an engineer, scrambling to fix a problem with your own service before customers hurt your feelings on Twitter. Incidents are painful, and we all want to avoid them.

They can strike all the way down the stack – from the front end, to the service code, right down to the infrastructure supporting the application. Modern applications have moved to more service oriented architectures, and with that comes an increased surface area for incidents coming from the infrastructure layer. Developers ideally don't have to pay attention to any of the infrastructure layers below them, but with that comes the trust that everything will "just work". When they don't, everyone above them feels the pain – right down to the end-user customer.

Automated software testing handles a lot of the bug-catching responsibility for software applications. The tests aren't going to catch every issue, nor are they expected to, but they do catch a lot of obvious, silly, or overt errors.

What handles this responsibility for infrastructure?

Infrastructure is in an interesting place right now. We've got a million different tools for managing infrastructure at scale, a trend known widely as Infrastructure as Code. For most people, long gone are the days of managing your own hardware. Now you can have a configuration file which describes your entire infrastructure setup – how many servers you need, what size they are, and how to set them up – and then have your preferred tool make that setup a reality.

In this article I'm going to talk about the other side of declarative infrastructure – not how to do it, but how to test it. Continuously.

A quick introduction

Before we get started, I think it's important to give a background on my team and the infrastructure we manage. I work on the Atlassian Kubernetes team, nicknamed KITT (Kubernetes Infrastructure Technology Team) in a reference to the hit TV show Knight Rider. Our team's mission is to become the one-stop shop for Containerisation at Atlassian, building and maintaining a platform that brings fast, cheap and resilient containers to internal developers.

Our customers are the internal developers of Atlassian. They want to run services or jobs with a variety of requirements and we run Kubernetes clusters to support them. Currently we manage more than 22 individual Kubernetes clusters to achieve this, and they range in size from small development clusters to very large production clusters. The configuration we use for these clusters is very custom. We do not use a cloud provider solution like EKS or GKE because when we started working on Kubernetes these did not exist or did not have the features we needed.

We deploy to all of our clusters multiple times a day. In under 15 minutes, we can deploy a merged PR out to all of these clusters via our CICD pipeline. Because of this constant influx of changes we need to be careful about picking up errors before we deploy. To address this problem we recently designed and implemented a testing pipeline to test this highly-customised Kubernetes infrastructure on each commit to master. The pipeline works well and we've caught real issues with it before they hit Atlassian developers.

Read on to find out the most important things we learned from that journey.

1. Understand what to test – and why

The use cases for the infrastructure you manage can vary heavily. You might manage one server running a very specific service or you might manage an entire compute platform (like my team!). It's very likely you manage something in between, in which case it's extremely important that what you test is representative of what you say your platform can do for your users.

If you claim your platform can do something, you should test that it can currently back that claim up.

Before you start writing your tests, you're going to want to list the core promises that your service or platform makes to its users. If you're not sure where to start, consider your service or platform's Service-Level Agreements (SLAs), and its key Service-Level Objectives (SLOs.)

For example, this is what it looks like for one of my team's Kubernetes clusters:

  • Users can schedule a container to the cluster using Kubernetes YAML
  • Users can access a HTTP server running in their container via the cluster ingress ALB
  • Users can retrieve logs from their container via the company logging portal
  • Containers in the cluster can resolve cluster internal DNS names, company internal DNS names, and external DNS names
  • Containers can assume AWS IAM roles
  • Containers are denied network access to specific network resources

Then, you want to expand these core promises into some of the domain specific knowledge that you have about your system. What you'll end up with is a slightly more exhaustive list, including some system components that are a few levels of abstraction away from the use cases, but directly impact the success of those use cases. For example:

  • Can apply updates to the cluster via tooling
  • Cluster nodes are all in a Ready state
  • System containers are up and working correctly
  • Can schedule a basic container to the cluster
  • A container can resolve cluster internal, company internal, and external DNS names
  • Container logs are sent to the company logging portal
  • Container HTTP servers are accessible via cluster ingress ALB
  • Containers can assume AWS IAM role via AWS libraries
  • Container network traffic is blocked to specific network resources
  • Metrics are being collected about containers in the system

For each of these statements you can create an automated test. These are the things that you care about.

Scoping your tests

An important part of understanding what to test is knowing what you shouldn't be testing. If you're too specific with your tests then you'll spend a lot of time updating and maintaining them, and they probably won't be providing much extra value.

Infrastructure setups can be very complex, with lots of moving parts and interconnected systems. To be efficient with your testing, you shouldn't be looking to reproduce all of the end-to-end scenarios that the systems support. Instead, rely on the integration and end-to-end testing that is (hopefully) done for each of these components before new versions are released.

Your focus should be on enumerating the end-to-end scenarios that your system provides to its users and looking to test those. It's more important to know that there's a problem with your infrastructure than exactly which component caused it, and far simpler too. Once you know that there's a problem with the overall workflow then you can use your knowledge of the system to dig in and find the cause. The most important part is that this detection happens before it impacts a customer.

2. New tools, new thinking

Declarative infrastructure tooling provides a massive increase in efficiency compared to older styles of infrastructure management. In order to set up cloud infrastructure in the past you would need to do some (or all) of the following:

  • Procure hardware to run everything on
  • Manage your own network hardware – switches, routers, cables, and configuration files
  • Manually install an OS on at least one of your systems to give you DHCP, NTP, TFTP, and other basic necessities for managing your servers and networking hardware
  • Use some kind of distributed tool to boot your hardware into usable images (e.g. PXE), or the more commonly seen "manually install an OS on every server"
  • Use some kind of tool to configure your hardware, or the more commonly seen "write a lot of custom configuration scripts you run via SSH"
  • Install some kind of monitoring system, usually based on rudimentary health checks, to ensure your fleet of snowflakes doesn't melt

Setting up an environment like this was painful. It took a long time and a lot of effort to upkeep. For more complicated setups you had to develop custom solutions to allow you to scale without the ops overhead becoming unbearable. Automatic testing for such an environment just didn't make sense. You had to manually install and run your changes onto your servers, so you would also just manually verify it. If it was a particularly complex change to verify, you also might have a script to run after installation to make sure everything is okay.

Fast forward to today – things are a lot simpler. Thanks to the power of public clouds, now all you need is the configuration tool. Proper tooling can alleviate every other step associated with managing your own infrastructure. Let's look at AWS as an example:

Hardware? EC2.

Network? VPC.

Operating System? AMI.

DHCP, NTP, etc? AWS provides everything out of the box.

System configuration? cloud-init.

Monitoring? CloudWatch.

You get all of this by talking to one set of APIs. Better yet, declarative infrastructure tooling allows you to define how you want your infrastructure to look and the tooling will worry about how to call the APIs for you. After running the tool you end up with the exact infrastructure setup you asked for.

Now automated testing starts to make a lot more sense.

Where once the norm was manually installing and configuring software on potentially each and every server, now there is a tool handling this. This increased speed of deployment and reduced operational overhead does two things: First, it increases the velocity of your releases; and second, it frees up some time from previously manual tasks.

In order to maintain confidence with this newfound velocity, you should look to spend some of this free time on developing automated, continuous tests for your infrastructure.

3. Automate your pipeline

The very first thing you need if you want to continuously test your infrastructure is to have a solid, automated, pipeline to build and deploy your changes. You should be able to stand up a new environment from scratch and without requiring any manual steps.

My team didn't follow this lesson in the beginning. We didn't have to do this very often and, because of it, we were inclined to do three or four small things manually when standing up our infrastructure. This won't work if you want to automate your testing properly because any manual tasks can damage the accuracy and maintainability of the testing pipeline.

While it is true that you can bend the rules to mitigate these manual tasks, mocking them out or replacing them with test-specific magic will often times undermine the purpose of the tests. Every manual step that you mock out is a point of your infrastructure pipeline that can change or break without you knowing in advance.

That said, let's look at some of the things you'll need for a fully automated pipeline. This section could be an entire blog of its own so I'll keep it brief.

Everything as code

Use a tool that allows you to treat your Infrastructure as Code! I've linked to a few earlier in this blog. Treating your Infrastructure as Code (or config) allows you to focus more on composition and architecture, leaving the dependency management and creation logic to the tooling. You don't want to be managing cloud infrastructure objects yourself – let your tooling do that.

No flakes

You need your pipeline to work reliably. If your automated pipeline isn't consistent you'll need to fix it. Developers aren't going to want to interact with your tests if they fail to stand up the infrastructure consistently. Although not ideal, sometimes you're going to need to put specific logic into your pipeline to wait long enough that things aren't flaky.

For example, when building our testing pipeline we found that if we set up and tore down a cluster a few times in rapid succession, then IAM policies that were being destroyed and re-created could take up to two minutes to be available to the instances that used them. Terraform will error out completely when it tries to use these IAM policies immediately after it creates them, causing a failed test run. To fix this, we unfortunately had to put in a horrible two-minute sleep after creating those resources. It makes the tests slower, but much more reliable.

Run in CICD

You want your pipeline to run via a CICD tool. This gives you a standard way to build and deploy your infrastructure, view the current state of it, and potentially to rollback in the case of problems. You can hook up this pipeline to be triggered by whatever you want, including commits to your Infrastructure as Code repositories. This is an absolute prerequisite to being able to run automated testing of your releases.

There are a few main things you have to do to make your pipeline work well with CICD:

  • First, you should package everything up into a container so that you have an explicit environment to run your builds from. Developing changes to the pipeline using a newer version of a tool locally and then having the older version running in CICD not doing what you want is very frustrating.
  • Second, make sure that you manage your CICD setup via config files and not the UI. This makes changing the build and deploy logic much more manageable, and you can keep track of all the changes you've made over time.
  • Finally, make sure that your build has access to everything it would have locally. If your build depends on external services, make sure you can authenticate to them and contact them from the CICD servers. If your build uses secret information, make sure that you can inject those safely and securely into CICD. Make sure that they never get logged. This is a problem that comes up a lot more often than it should, and it can cost a lot of development time in cleanup.

4. Mutually automated destruction

Let's say you can set up your infrastructure with one command. Can you tear it down just as easily?

Infrastructure testing is vastly different to software testing when it comes to tearing down your fixtures. With a software project, teardown might involve cleaning up some fixtures like database connections. There's usually not more than one or two levels of dependency between these fixtures, and because the logic for creating them is usually explicitly present in your code, it's quite easy to reason about how to destroy them. Often times there are no fixtures, instead just carefully created mock objects which make things even easier.

With an infrastructure project, creation of your environment includes multiple carefully ordered API calls to cloud providers to set up the skeleton of your system. Often times these operations themselves are a convenient abstraction of a myriad of other API calls which manage dependencies in a way that isn't transparent to the end user. When you go to reverse these API calls later, you will often find that running destroy operations in reverse order will fail due to extra dependencies.

If you decide to fix this by mapping out exactly which calls you need to make, then you're tying yourself to your current setup. If you change something in your infrastructure, you'll also need to change it in the custom teardown code. You'll forget to do that one day and your tests will go red even though they passed, simply because you couldn't tear down your environment automatically.

So back to the original question: Can you repeat a build and destroy cycle over and over again without flaking or failing?

If the answer is "no," then you're going to end up frustrated. For every time you need to go into your test environment and manually fix it, you're going to lose a little faith that the tests are worth the effort you're putting in.

This is why having a good way to reliably clean up your test environment is so, so important.

Let's look at some of the ways you might go about tearing down your infrastructure. I'm going to have a particular focus on AWS here, since my team has a lot of experience with it. I have no doubt that equivalent tooling or methods would be available for any of the other public clouds.

The bad

AWS libraries

If you've never tried to tear down a populated VPC full of infrastructure via the AWS CLI before, then you might think it's simple. In the web console, if you delete the VPC it will magically delete everything inside it without complaining about object dependencies. Unfortunately, the CLI won't let you delete a VPC if anything still exists inside it – subnets, security groups, instances, or anything else you can imagine.

You might imagine that it would be feasible to script the destruction of your environment by listing and deleting objects via the AWS CLI until the VPC is empty. Well, it almost is, but with some very important caveats:

  • First, you have to know exactly what objects depend on other objects. For example, Security Groups can't be deleted until the instances using them are terminated. Volumes can't be terminated until they're detached from their instances. Rules, rules, rules.
  • Second, if you want to actually do the deletion according to these dependencies, the logic has to be represented in code that you write. As a test developer you really don't want to know about any of this. Not only because it's complicated, but because there's so much possibility that a dependency will change, or a new dependency will be added. Maintaining this alongside your tests sounds like a bad idea.

Infrastructure as Code tooling

Infrastructure as Code tools like Terraform support destroying your infrastructure via the tooling that stands it up. I talk about Terraform because we use it extensively in our stack. In the Terraform model, everything is represented in a dependency graph. Setup involves traversing the graph in one direction and creating or updating dependencies before the objects they depend on. Destruction involves traversing the graph in the opposite direction, destroying dependent objects before their dependencies.

This works really well for managing our infrastructure at scale. I've had a really good overall experience with Terraform for this. What it doesn't do quite as well is allow for completely automated teardown of an environment, especially when that environment can break due to misconfiguration (which happens all the time in automated testing).

Terraform is a relatively sensitive tool. It tracks state within a single file called the "state file" and if this file gets out of sync or corrupted then it requires manual intervention to fix. A state file can end up like this for a variety of reasons, but interrupted builds or infrastructure changing in certain ways without those changes being made through Terraform are two of the most common for KITT.

Another shortcoming of using Terraform to reliably tear down infrastructure is if you use its inbuilt lifecycle protection tags. Etcd is a database server that is the lifeblood of a Kubernetes cluster. It stores all of the state for a cluster and as such needs to be treated with care. To ensure that we can never accidentally destroy the AWS instances running etcd, we mark these with the prevent_destroy lifecycle tag. Unfortunately, this means that if we tell Terraform to destroy a cluster, it will fail with an error as soon as it reaches these instances. We also need to delete these instances if we want to clean up the VPC properly – leaving them around isn't an option.

Finally, Terraform destroy is a relatively slow operation. It makes sure to safely destroy everything, and verifies the destruction of a resource before it moves on to destroy objects that the resource depends on. This is great for production infrastructure, but not so great when you want fast test build turnaround.

Given these problems, we had to look elsewhere for a way to destroy our infrastructure reliably within our automated testing pipeline. Thankfully, someone out there had written a tool that helped us get around this problem.

The good

aws-nuke is a tool that can remove all resources from an AWS account. It works in a very simple, yet effective way – it will list everything in the specified AWS account and try to delete it over and over until eventually there is nothing left to list. By default it removes absolutely everything from the account, but you can supply a configuration file to tell it to ignore certain resources. This is very useful for us because it means we don't have to worry about dependencies, and it works as fast as AWS can delete the resources.

Filtering is an important part of using aws-nuke effectively. To stop aws-nuke from nuking a particular type of resource you can add it to an excludes list, which blacklists listed resource types (e.g. S3Object, or IAMRole) from being destroyed. More useful though, you can specify a whitelist of resources in a targets list, allowing you to only nuke resource types that you specify. This makes using aws-nuke a lot safer since you can guarantee you're only nuking a subset of resources.

You can filter by resource names, tags, and other metadata too. You can specify a type of filter (the most useful is regex) and a property of the resource to filter on, for example the "Name" tag. Then you provide a regex, which if matched will include the resource for deletion. You can also invert this selection, so that if a resource does not match the regex it will be deleted. If you have good tagging or naming conventions for your infrastructure, then it should be easy to only nuke the pieces of infrastructure that you create during your build.

Even though filtering allows running aws-nuke safely inside an AWS account that has other resources inside it, I still highly recommend having a separate AWS account for testing your infrastructure. Ideally this account contains nothing important, so that if aws-nuke were to nuke every object in the account then it would survive fine.

Not every resource type is supported by aws-nuke yet. Some of them don't have the particular filters you might want either, since these have to be implemented on an object-by-object basis. While making the KITT testing pipeline, I had to make an upstream PR to get some filtering options added for resource types I cared about. Thankfully the dev team were very open to the contribution and it was a fairly painless process.

The end result of using aws-nuke has been fantastic for our team. Out of over 300 test builds we haven't had a single one fail to tear down the infrastructure successfully.

Survivors

So now that we've established a reasonable way to tear down your test infrastructure, let's look at some of the things you might not want to tear down. There are likely some fixtures in your infrastructure setup that you don't have control over, or otherwise wouldn't want to destroy. It's important to identify and manage these exceptions to ensure that they make sense and aren't reducing the value of your tests. Let's talk about some good candidates for exclusion.

Things that take a long time

Especially if they don't add a lot of value to the testing process. For us, attaching a Virtual Private Gateway to our VPC took around 10 minutes of our total build time and didn't really add a lot of value. This made the gateway and the VPC objects great candidates to keep around between test builds. Keeping the test build running fast is important because it gives feedback to developers faster, increases change velocity, and reduces developer frustration.

Things that are hard to automate

As much as we want to automate the creation of everything, sometimes that's not feasible. Maybe another team manages something you rely on and it isn't very teardown friendly (if you're focused on building composable infrastructure you should be relying on other teams to provide things for you!). Maybe there's a component that's expensive to change or create. In these cases it might make sense to import existing resources to your build instead of creating them every time.

For example, KITT uses AWS Certificate Manager (ACM) certificates for the ELBs in our infrastructure. These certs are validated via the existence of Route53 DNS records. To ease the use and maintenance of these, coupled with the fact that they don't change very often, we explicitly avoid destroying the DNS zone and ACM certificates between test runs. Importing them is cleaner and quicker than changing our entire infrastructure pipeline to support dynamically provisioning them.

5. Use multiple test environments

The more developers your team has, the more important it is to be able to facilitate parallel development. With software testing you can run your unit and integration tests locally to get instant feedback. Branch builds are another way to verify that the same tests will pass in a CICD pipeline. Both of these don't really have a limit to their concurrency – you're usually just starting up a copy of your application and running tests against it locally. Testing infrastructure is a bit different.

Infrastructure often depends on building blocks provided by other services to function properly. These dependencies often require explicit mappings or configuration to function. At the time of writing each of my team's Kubernetes clusters require a Virtual Private Gateway connection to be provisioned by Atlassian's Core Networks team. This connection is what allows the cluster to talk to resources on the company network. My team can't automate this step, so we have to cap the parallelism of our tests to accommodate this.

These kind of scenarios happen a lot more for infrastructure than for software. It's not a bad thing – good infrastructure design will leverage other services as much as possible. However, it does make testing the infrastructure a bit trickier. To get around this problem we implemented multiple test environments which could be built concurrently.

Each test environment is pre-setup like a real cluster. They have IP space and Virtual Private Gateway pre-assigned to them, and they can be built and destroyed independently of each other. The trickiest part of this multi-cluster setup is having the tests automatically choose which environment to use for a build.

While running multiple environments like this consumes more resources, it has many benefits. Your developers can now run branch builds or test frequent commits to master immediately, rather than in sequence. It also helps your team scale and accommodate more developers without impacting their velocity. Reducing the amount of time waiting on tests can actually help reduce incidents because fewer changes will back up in the pipeline.

Catch real problems

At this point your basic infrastructure testing pipeline should look something like this:

  1. Manual trigger or commit trigger begins test pipeline
  2. Set up infrastructure
  3. Run tests against infrastructure
  4. Tear down infrastructure
  5. Report testing results

Each of those parts is a lot more complicated than it sounds, but the basic flow is right. A failure in setting up the infrastructure or a failure of any test should result in a failed test build, giving valuable feedback to the developer responsible for the latest changes.

If you run this pipeline every time you commit to master then you'll end up catching a lot of problems before they have the chance to make it to your customers. Since my team started using the new testing pipeline we've caught over seven issues using the build; issues that all missed PR review and manual testing. These range from small configuration mishaps, like entering the wrong name or namespace in some Kubernetes YAML, to much harder-to-diagnose problems like AMI kernel bumps causing existing configuration to misbehave. Hopefully this trend continues and the tests keep saving us pain from deploying problematic versions of our configuration to live clusters.

We hope this information has inspired you to try out some automated testing for your own infrastructure pipelines. The old method of manually testing and verifying changes isn't fast or reliable enough to keep up with new declarative infrastructure pipelines. If you want to move fast while staying reliable, then automated testing is an incredibly valuable tool to have in your arsenal.