Close

Incident management in the age of DevOps

Applying principles of open, blameless communication to incident management teams

You can’t rethink how you build, deploy, and operate software without rethinking how you respond to incidents.

In their seminal 2009 talk, “10+ Deploys Per Day: Dev and Ops Cooperation at Flickr," John Allspaw and Paul Hammond sketched out a vision of a world where developers and IT Ops teams work together and ship often. Over the next decade, that vision took shape as the DevOps movement.

The nature of DevOps relies on new ways of responding to incidents. It’s not surprising that incident management got so much attention in Allspaw and Hammond’s talk.

“The important thing to realize is that failure is going to happen,” Hammond said in the talk. “It’s not a question of if, it’s a question of when.”

Unlike frameworks like ITIL, there is no “official” document of best practices for a DevOps team. But, we can generally agree that, at its core, DevOps is about delivering business value to an organization by breaking down organizational silos, increasing transparency, and fostering open communication between developers and IT operations teams.

That same culture of transparency, visibility, and rapid learning extends to incident management.

Why? Because the first and most critical steps in incident management involve understanding what's gone wrong, getting the right people working on the problem, and fostering a blameless culture.

DevOps incident management calls for a culture of open, blameless communication between developers and IT ops teams. And establishing lightweight processes that improve the reliability of IT services, increase customer satisfaction, and drive business value.

ITIL, by comparison, is a prescribed set of 26 processes, procedures, tasks, and checklists designed to improve specific practices in IT service management. ITIL focuses on service quality and consistency and improving the resilience of systems.

One of the benefits of ITIL is that organizations that want to improve ITSM can begin with templated best practices instead of starting from scratch. And while some believe ITIL is best suited for large enterprises, the framework is flexible enough that smaller companies can pick and choose the processes that make sense for their business and still find value.

One downside to ITIL—if you're in a hurry to make changes to your incident response process—is that it can involve formal change management and an expert consultant, delaying improvements.

For teams who want to get started right away, the DevOps incident management approach will help them come together and realize benefits immediately.

The DevOps incident management process

The DevOps approach to managing incidents isn’t radically different from the traditional steps to effective incident management. DevOps incident management includes an explicit emphasis on involving developer teams from the beginning--including on call--and assigning work based on expertise, not job titles.

1. Detection  
Instead of hoping incidents never happen (because they undoubtedly will), DevOps incident response teams place a high value on preparedness. They work collaboratively to plan their responses to potential incidents by identifying weaknesses in systems. They set up monitoring tools, alert systems, and runbooks that help each member know who to contact when an incident is detected and what to do next.

2. Response  
Rather than having one single on-call engineer responsible for responding to all incidents in an on-call shift, DevOps incident management teams designate multiple team members to be available for escalations. If the designated on-call engineer can't resolve an incident independently, there's a runbook ready to act as a guide. The on-call engineer can bring in the right people to assess the impact and severity level of the problem and escalate it to the right responders.

3. Resolution
When it comes time to respond to an incident, DevOps incident management teams can often get to resolution quickly. This is because, as a whole, they're more familiar with the application or system code—because they wrote it! And with the benefit of advanced preparation and good communication systems, together they can do the work that resolves the incident, reaching resolution faster than a third-party response team looking at the code for the first time.

4. Analysis  
DevOps incident management teams close out an incident with a blameless postmortem process. They come together to share information, metrics, and lessons-learned with a goal to continuously improve the resilience of their systems, as well as resolve future incidents quickly and efficiently.

5. Readiness
Once an incident is resolved, all remediation steps have been completed, and the system is restored, DevOps incident management teams take a step back to assess their readiness for the next incident. They take what they learned in their postmortem process and update their runbooks and make any necessary adjustments to monitoring tools and alert systems. And the DevOps focus on continuous improvement applies to the people and team, not just the technology. After an incident, each team member is better prepared for the next one.

Best practices for effective DevOps IM teams

Adopting a DevOps approach to incident response can lead to improved communication between development and IT operations teams, faster incident response and remediation, and a more resilient system.

Automate processes and workflows
Integrate your service desk, monitoring, ticketing, and chat tools to streamline IT incident alerts and workflows to ensure the right people get notified with the information they need to get started on a resolution. Set up runbooks with pre-defined workflows so people can hit the ground running when an incident hits.

Communicate between teams
Ensure members of your teams can communicate across the organization with real-time chat tools. Use tools that create a record of the incident so anyone can jump in at any time and get up to speed on what's happened and what's being done.

Use the blameless approach
After you've resolved the incident come together as a team to review what happened for a blameless postmortem session. Avoid finger-pointing and focus on sharing information that helps everyone do their jobs better and contributes to a more reliable system.

Identify and focus on the business bottom line
DevOps incident response is more than a means to better communication; it's a way to ensure developers and operations are working together to deliver real business value. Track metrics such as mean time to detection (MTTD), mean time to repair (MTTR), and mean time between failures (MTBF) to understand your team's rate of improvement.

Utilize on-call scheduling to position developers and sysadmins as SREs
On DevOps teams, the lines between developer and sysadmin start to blur and those responding to the incident often become site reliability engineers (SRE). Still, individuals will likely have specialized knowledge either in the application code or the infrastructure code. Set up your on-call schedule to ensure you've got the right mix of expertise available to respond to incidents.

Up Next
SRE