Jez Humble, author of Continuous Delivery and one of its founding fathers, has an informal survey he likes to give to audiences. It starts with a simple question: "Raise your hand if you do continuous integration." A sea of hands always rise. Then he says "Put them down if all of the developers on your team don't check into the main code-line at least once a day." A large number of hands usually fall. Then he probes: "Put your hands down unless every check-in triggers a build plus unit test run." More hands go down. And finally: "Put your hands down if the build isn't fixed within ten minutes of it going red." Humble reports that more often than not, only a few hands remain raised.
He's trying to illustrate the point that unless your organization has a stable, reliable CI environment and a culture of utilizing it effectively, moving toward CD will prove to be a painful waste of time and resources. So step one of CD is to ensure your implementation of CI is as stable and reliable as you assume.
Continuous delivery requires that the software release infrastructure be used… well… continuously. If the CI infrastructure upon which your CD efforts are built is rushed or thrown together, you run the risk that parts of it will fail. Unlike traditional release models, if that infrastructure breaks, it severely impacts every part of the company. In a more waterfall release model, there is time and space for the crashy CI server to be rebooted, for artifacts to be moved around because the file server filled up again, or release engineers to log onto every slave and install that package a developer installed on only one slave during a debugging session. In CD, such problems bust the pipeline for everyone from developers committing to customers getting the new packages. And such breakages become incredibly obvious to the entire company.
Masters and slaves
The first step to determining the reliability of your CI infrastructure is to look at its constituent parts: the master CI server and the slaves that do the work. Assuming you're building your CD pipelines in-house, answering these questions will give you insight into the current state of your CI world.
For the master:
- Is the machine configuration under automated configuration management (CFEngine, Chef, Puppet, etc.) so it can be rebuilt from scratch in a totally automated fashion? Can this process run self-contained, or does it require downloading bits (plugins, etc.) from external websites? How long would a rebuild take?
- Is the data contained within the CI master available elsewhere? Are items such as build logs (especially for shipped builds), artifacts, test results, and other build metadata recoverable, or would data as fundamental as build numbers be lost if the master disappeared?
- Is there access control to the system configuration for the master server, or can anyone log in and change anything? How about the individual job configurations? When changes are made, is there an audit trail? Are stakeholders notified?
- Are job configurations in version control or specified within the CI tool itself? Are job configuration changes tracked anywhere?
For the slaves:
- Are the slaves under automated configuration management so they can be re-created in an automated fashion for all supported platforms? (Oddly, I often see Linux slave configuration automated, but Windows and Mac slaves are left in various degrees of manual configuration.)
- Who has login access to the slaves? Are slaves considered "dirty" if logged into, and if so, what happens to that slave after?
These may seem like simplistic questions, but it is still incredibly common for jobs to fail on certain slaves, yet work on others. Or for developers to demand login access and make configuration changes, which never get re-incorporated into the build environment process for the build/release teams or even developers' local machines.
A general question that applies to both master and the slaves: who is responsible for maintenance of the machines that make up your CI infrastructure and backups of critical data? Often times, this is a function served by a separate IT team. If so, beware of conflicting requirements. A client of mine once had a frustrating problem where the CI server suffered intermittent failures during the day. It turns out a developer had been "helpfully" taking the server down in his free moments to make backups (killing every running build in the process).
In another situation, the QA and release teams were banding together to burn the midnight oil on a huge release. Just as they started the release process, the CI infrastructure went down. Turns out the IT team was adhering to their published backup and maintenance schedule for the CI servers, but no one had thought to communicate the critical release's schedule. The escalation chain to get the backup process halted and the CI systems back up at 2 am so the company could meet its early morning deadline left everyone in a bad mood. Moral of the stories: no matter who serves these functions in your organization, they need to be part of the communication loop with the CI tool administrators and its users.
Despite sounding absurd, a good gauge of your CI infrastructure's state is knowing the answer to "If we decided to completely switch CI tools, how long would it take us to move all of the configuration, and could we recover all of the metadata, logs, artifacts, and other information from past builds we care about." If the answer is "a long time" and "no," then there is work to do. (And you'd be surprised how often this actually happens in practice, with changing team members and opinions and new tools.)
Users of cloud-based CI services may assume they needn't worry about any of these issues, but it is common as a company grows or changes for CI functionality to be moved to other cloud infrastructure, perhaps to take advantage of bulk pricing or move to a private cloud. Sometimes, demands for faster builds may necessitate bringing CI infrastructure back in-house or to bare metal… in which case, you're effectively "switching CI tools," and must take all the above into consideration. It pays to know how the cloud-based CI agents are configured.
Baby steps for the QA and release teams
As the teams responsible for your CI infrastructure do their reliability and stability reviews and address any issues, that's a good time for other teams who will help the CD journey to start looking at the state of their worlds.
Since fully automated (unit and integration) testing and quality assessment is a requirement for CD to provide any business value – otherwise, you're just shipping garbage quickly – QA teams can start assessing the work ahead of them. The best methodology I've seen is to tackle this on two different fronts. On one, start acculturating the teams to the necessity of writing automated unit or integration tests for each and every filed defect and then start writing them in your most critical components. This will start chipping away at uncertainty and regression risks from bugs that have been found.
On the other front, integration and functional tests can be written that test not for aberrant behaviors, but for the intended ones. Many of these tests may currently be manual or require human intervention, so the focus is on making them fully automated in such a way that no humans are involved in the execution of the tests. In many cases, this requires the QA team to evaluate and communicate test environment requirements to the team responsible for the CI infrastructure or if they operate their own test infrastructure, go through the above operational exercises as well. For certain types of software, dedicating time to the creation of automated "fuzz tests" can also prove very beneficial and provide a lot of value in a CD context.
When starting to analyze what work may need to be done on your software's build process, looking at the endpoints of the build/release process is particularly useful:
- Is the coupling between the source control system and the CI infrastructure, the root of the CD pipeline, stable? Sounds like a silly question, but in the world of cloud-based source control and many tiny Git repos instead of one monolithic repo, I run into all sorts of failure modes where a commit doesn't reliably kick off a build when one would be expected. In addition to source control system, are packaged dependencies managed reliably?
- On the other end of the build process, is the final packaging produced entirely consumable, by either customers or the deployment automation? Do those artifacts reside in a place that is easy to automate deployments out to a production environment or to the place where customers can get at it? What is the artifact retention and management story for those builds?
As a fan of Law and Order, I often dub these questions, collectively, as the "Build Chain of Evidence." Environments still exist today where there is no clear way to figure out what commit(s) went into a particular artifact, where the test data that illustrated a critical regression is, or what build that data relates to, and no one can tell whether or not a particular artifact is important and should be kept. A CD pipeline relies on this chain containing all of the important (meta-)data and (obviously) that "chain of custody" not breaking at any point within the pipeline.
While all of these suggestions might sound like good practice, one might ask themselves why investing so much effort in CI is worth it. "We want CD! Why spend time on CI?!" The answer: because they build on each other and that old analogy about foundations is relevant when building your "CD house."
That is why investment in making your organization's CI infrastructure rock-solid will not only pay dividends as you work toward CD, but is actually a requirement if you are to build a CD pipeline that won't spring leaks and burst open in times of increased pressure and development flow. Once you have a good CI foundation built, you can start looking at the next steps to move toward CD in your organization. And building that foundation already knowing that it is indeed the foundation – not the house itself – puts you and your team at an advantage.