How to measure, use, and improve DevOps metrics
Like other elements of the DevOps lifecycle, a culture of continuous improvement applies to DevOps metrics. The ability to receive fast feedback at each phase of development, coupled with the skill and authority to implement feedback, are hallmarks of high-performing teams. In the DevOps book “Accelerate”, the authors note that the four core metrics listed above are supported by 24 capabilities that high-performing software teams adopt. We cover most of these capabilities below (CI/CD, test automation, working in small batches, monitoring, and continuous learning), but it is worth reading “Accelerate” for a deeper dive into the research that supports these practices.
Lead time for changes
High-performing teams typically measure lead times in hours, versus medium and low-performing teams who measure lead times in days, weeks, or even months.
Test automation, trunk-based development, and working in small batches are key elements to improve lead time. These practices enable developers to receive fast feedback on the quality of the code they commit so they can identify and remediate any defects. Long lead times are almost guaranteed if developers work on large changes that exist on separate branches, and rely on manual testing for quality control.
Change failure rate
High-performing teams have change failure rates in the 0-15 percent range.
The same practices that enable shorter lead times — test automation, trunk-based development, and working in small batches — correlate with a reduction in change failure rates. All these practices make defects much easier to identify and remediate.
Tracking and reporting on change failure rates isn’t only important for identifying and fixing bugs, but to ensure that new code releases meet security requirements.
High-performing teams can deploy changes on demand, and often do so many times a day. Lower-performing teams are often limited to deploying weekly or monthly.
The ability to deploy on demand requires an automated deployment pipeline that incorporates the automated testing and feedback mechanisms referenced in the previous sections, and minimizes the need for human intervention.
Mean time to recovery
High-performing teams recover from system failures quickly — usually in less than an hour — whereas lower-performing teams may take up to a week to recover from a failure.
The ability to recover quickly from a failure depends on the ability to quickly identify when a failure occurs, and deploy a fix or roll-back any changes that led to the failure. This is usually done by continuously monitoring system health and alerting operations staff in the event of a failure. The operations staff must have the necessary processes, tools, and permissions to resolve incidents.
The focus on MTTR is a shift away from the historical practice of focusing on mean time between failures (MTBF). It reflects the increased complexity of modern applications and thus, an increased expectancy of failure. It also reinforces the practice of continuous learning and improvement. Instead of waiting until the deploy is “perfect” to avoid any failure (and thus, resetting the old MTBF scoreboard), teams continuously deploy. Instead of placing blame for ruining a “perfect” MTBF record, MTTR encourages blameless retrospectives to help teams improve their upstream processes and tooling.