For the last 3-4 weeks the Jira team has been following a development process where we are allowed to check in…when the build is broken.
Well it turns out that it helps keep the build green.
And here is why…build latency.
I am defining build latency as the time from checking-in a change, finding out your broke the build and then checking in another change and knowing that you have fixed the build.
4 Months Ago
About 4 months ago our development process rule was that you must not check when the build is broken and you must not check in and go home unless you know it has passed.
Our unit tests and web functional tests were taken about 7 minutes and 2 1/2 hours to run respectively. Therefore if some one did break the build, it took at a minimum 2 1/2 but more like 3 1/2 hours to fix.
You had to find out who broke things, find out why, find the person and then get a fix in. And then you had to wait another 2/12 hours to see if the fix worked.
Because of this build feedback latency, we as a team ended up checking-in in a very synchronized way (say at 10.00am and then at say 3.00pm if on wanted to go home by 5.30pm) and because in a team of 12 people, some one is bound to break something at least once a day, we could check-in changes at most once, maybe twice a day.
Because you are only checking in once or twice a day, the changes tend to be larger and hence more likely to break the build. It becomes a bit of a negative feedback system of broken builds.
During this period the average time of breaking the build to it going green again could be about 2-4 hours but sometimes it was days.
People would then break the rules and check in when the build was broken, but this would often just confuse the situation and extend the build latency.
2 Months Ago
The long build times had been bubbling away for a long time and during dev speed week, Chris M and others did the work to split the build into a series of batches.
So instead of running 2000 tests in serial over 2/12 hours, we split them into 15 batches and ran them in parallel using the Bamboo Elastic Cloud support to create 15 remote agents.
This resulted in build run times of between 12-40 minutes. (Splitting into an even number of batches doesn’t mean an even number of test run times it seems)
This reduced the build latency down to a more manageable level. Now you found out quicker that the build was broken but it was still talking about 1 hour to 1 1/2 hours to fix a broken build and the others in the team could not check in for that time period.
1 Month Ago
So about a month ago we dropped the rule about checking in when the build is broken.
However there are a series of caveats.
- When the one of the batches breaks, we as a team have to work out who broke it.
We have 15 batches of builds and hence you may have broken 1 or more batches
- That person is then responsible to fix the build batch(es) in question.
- The other team members are free to check in while this happens.
- If the build continues to stay broken, or gets more broken (eg more and different tests start breaking) then a freeze is put in place to sort it out.
- Jed was appointed build champion to keep an eye on the situation and to chase up people.
The end result is that the build does not stay broken for long and the rest of the team can continue to make small changes to the code base.
Small changes are less likely to break the build and hence it tends to stay green for longer. Its a more of a positive feedback system.
We also added another little feature into the functional test framework that also helps.
We keep track of the tests that have failed ourselves instead of relying on JUnit/Bamboo test results. At the end of each test case run, we output a log message that tells us the state of the system.
It looks a bit like this :
===FTC Finished : TestSystemFieldLiterals.testType #2164 of 2361 (91.66%) : Errors 1 (0.042%) : Run time 4.59 seconds : Suite time 12656.18 seconds : HTTP Count 40 : HTTP Time 2732ms : HTTP 100th 1637ms : HTTP 90th 88ms : HTTP 50th 18ms : HTTP Ave 0.07 ms/request : Max Mem 365428736 : Total Mem 365428736 : Free Mem 125712536 1 test(s) failing so far : TestIssueDreams.testIssueLevelSecurity
You can tell by looking at the Bamboo live logs if the build is going to fail. If the log tells you that the one or more tests has failed, you can start working on the fix and check it in before the batch build finishes and hence reduce the build latency.
Conventional thinking is that a team should not check-in while the build is broken. The thinking goes that that you will not be able to keep track of the changes your team makes otherwise.
We have found that you can make changes while the build system is broken, within reasonable limits.
And the longer your build latency, the greater the payback of this type of development process.