Five tips for CI-friendly Git repos
If you follow Atlassian, you know we're big on continuous integration ("CI") and Git–separately, sure: but even bigger on the power that the two offer in combination. Today I want to share some tips for getting your CI system to interact optimally with your repository, which is where it all begins.
One of the things you often hear about Git is that you should avoid putting large files into your repository: binaries, media files, archived artifacts, etc. This is because once you add a file, it will always be there in the repo’s history, which means every time the repo is cloned, that huge heavy file will be cloned along with it. And getting a file out of the repo’s history is very tricky, It’s the equivalent of performing a lobotomy on your code base. And this surgical file extraction alters the whole history of the repo, so you no longer have a clear picture of what changes were made and when. All good reasons to avoid large files as a general rule.
But keeping large files out is especially important if you are doing CI.
Each time you build, your CI server has to clone your repo into the working build directory. And if your repo is bloated with a bunch of huge artifacts, it slows that process down and increases the time your developers have to wait for build results.
Ok, fine. But what if your build depends on binaries from other projects or large artifacts? That’s a very common situation, and probably always will be. So the question is: how can we handle it effectively?
A storage system like Artifactory (who make an add-on for Bamboo), Nexus, or Archiva can help for artifacts that are generated by your team or the teams around you. The files you need can be pulled into the build directory at the beginning of your build–just like the 3rd-party libraries you pull in via Maven or Gradle.
Now you may be thinking “Oh, I’ll just sync my big files to the build server each night so I only have to transfer them across disk at build time.”
Even though a disc transfer is much faster than network transfer, I actually recommend against doing this, especially if the artifacts change frequently. In between your nightly syncs, you’ll end up building with stale versions of the artifacts. Plus, developers need these files for builds on their local workstations anyway. So overall, the cleanest thing to do is to just make artifact download part of the build.
Each time a build runs, your build server clones your repo into the current working directory. As I mentioned before, when Git clones a repo, it clones the repo’s entire history by default. So over time, this operation will naturally take longer and longer. Unless your CI system uses shallow clones.
With shallow clones, only the current snapshot of your repo will be pulled down. So it can be quite useful for reducing build times, especially when working with large and/or older repositories.
But let’s say your build requires the full repo history–if, for example, you have a release build that adds a tag or updates the version in your POM, or you’re merging two branches with each build.
Earlier versions of Git require the entire repo history to be present in order to push changes. As of 1.9, simple changes to files can be pushed without the entire history present. But merging still requires the full history because Git needs to look back and find the common ancestor of the two branches–that’s going to be a problem if your build uses shallow cloning. Which leads me to tip #3.
This also makes the cloning operation much faster, and some CI servers actually do this by default.
Note that repo caching only benefits you if you are using agents that persist from build to build. If you create and destroy build agents on EC2 or another cloud provider every time a build runs, repo caching won’t matter because you’ll be working with an empty build directory and will have to pull down a full copy of the repo every time anyway.
Shallow clones plus repo caching, divided by persistent vs. elastic agents, equals an interesting web of factors. Here's a little matrix to help you get strategic about it.
It goes (almost) without saying that running CI on all your active branches is a good idea. But is it a good idea to run all builds on all branches against all commits? Probably not. Here's why.
Let's take Atlassian, for example. We have upwards of 500 developers, each pushing changes to the repo several times a day–mostly pushes to their feature branches. That's a lot of builds. And unless you scale your build agents instantly and infinitely, it means a lot of waiting in the queue.
One of our internal Bamboo servers houses 935 different build plans. We plugged 141 build agents into this server, and used best practices like artifact passing and test parallelization to make each build as efficient as possible. And still: building each commit was clogging up the works.
Instead of simply setting up another Bamboo instance with another 100+ agents, we stepped back and asked if this was truly necessary. And the answer was no.
We found that a good way to balance testing rigor with resource conservation is to make builds on the dev branches push-button. This is where most of the change activity is happening, so it’s the biggest opportunity for savings. Developers find that it fits naturally into their workflow, and they like the extra control and flexibility this gives them.
For critical branches like master and stable release branches, builds are triggered automatically by polling the repo for changes. Since we use dev branches for all our work-in-progress, the only commits coming into master should (in theory) be dev branches getting merged in. Plus, these are the code lines we release from and make our dev branches from. So it’s really important that we get timely test results against every commit.
Another option is move away from polling altogether, and have the repo call out to your CI server when a change has been pushed and needs to be built. Typically, this is done by way of a hook in your repository.
You can do this with whatever tooling you've got, but as it happens, we recently added an integration between Bitbucket Server and Bamboo that makes this extra set-up unnecessary. Once Bamboo and Bitbucket Server are linked on the back end, repo-driven build triggers Just Work™ right out of the box. No hooks or special configs required.
Regardless of tooling, repo-driven triggers carry the advantage of automatically fading into the sunset when the target branch goes inactive. In other words, you'll never waste your CI system's CPU cycles polling hundreds of abandoned branches. Or waste your own time manually turning off branch builds. (Though it's worth noting that Bamboo can easily be configured to ignore branches after X days of inactivity, if you still prefer polling.)
Rubber, meet road
You can implement every tip I've given here with any CI server on the market. But since we're always looking to make best practices easy to practice, we've baked them all into Bamboo so they're dead-simple to set up. Hop into the tour and check out all the goodness.