Incident management for high-velocity teams
Escalation policies for effective incident management
When an incident strikes, the best-case scenario is that your on-call engineer or SRE can resolve it quickly and on their own.
Of course, in the real world, that isn’t always the case. Sometimes resolution calls for a larger team, specialized knowledge, or more senior skills. Which is why any organization with more than two tech professionals needs a plan and policy for incident escalation.
What is incident escalation?
Incident escalation is what happens when an employee can’t resolve an incident themselves and needs to hand off the task to a more experienced or specialized employee.
What is an escalation policy?
An escalation policy answers the question of how your organization handles these handoffs. It outlines who should be notified when an incident alert comes in, who an incident should escalate to if the first responder isn’t available, who should take over if or when the responder can’t resolve the issue on their own, and how those handoffs should happen (through the service desk? Directly from one technician to another? Through an incident management tool?).
At first glance, these questions seem simple, but the larger your organization and the more complex your tech ecosystem, the more the answers call for detail. For example, when identifying who should be notified when an incident alert comes in, the answer may vary based not only on who is on call or available, but also based on severity levels, duration of the incident, etc.
For some companies, a single on-call person may be the first notified no matter the incident severity. For others, it might make sense to alert a junior developer if the incident is a SEV 3 and notify a more senior person or specialized team if it’s SEV 1.
Similarly, some companies may rely on their first responder to escalate an incident when needed. Others may trigger an automatic escalation to a more senior developer or specialized team if an incident exceeds a certain amount of time or starts impacting a higher number of systems or users.
An escalation policy should address not only how your company will escalate incidents and to whom, but also if there’s nuance based on the type of incident, SEV level, duration, and scope of the incident.
Incident escalation processes
For companies following ITSM best practices, typically the service desk is at the center of incident escalation. If the first responder can’t resolve an incident, they circle back to the service desk, which escalates the issue to the appropriate next line of defense. Using Jira Service Management, responders can escalate incidents within the incident ticket. Responders have access to workflows to guide the resolution process and can enact automation, or customize actions as needed. Designating a severity level can direct responders to the appropriate workflow.
Other companies, like Google, put an SRE in charge of incidents and that person is responsible for any necessary escalation (as well as freezing new releases in the case that an incident pushes the team over their acceptable downtime threshold according to their SLA/SLO).
For still other companies, the first responder may be a developer or an incident manager or there may be multiple first points of contact (especially when an alert comes in for a high-severity incident) and escalation may happen through predefined processes directly in and between those teams.
Whether the process goes through the service desk, is facilitated by an SRE, or happens automatically within your incident tracking systems, there are typically three paths escalation policies follow.
Hierarchical escalation is when an incident is passed to a team or person based on their experience level or seniority within the organization.
For example, the first responder on-call might be a junior developer new to the team. If they can’t resolve an issue, in a hierarchical organization, they pass that issue to a more senior developer. If the more senior developer also can’t resolve the issue, they again pass to a more senior developer—and on up the line until the issue is resolved.
Functional escalation is when an incident is passed to a team or person best equipped to resolve it based on their skills or systems knowledge, not their seniority.
For example, the first responder on-call may be a junior developer from a team that focuses on the back end of product X. If they discover that the core problem appears to be coming from an integration with product Y, they may escalate the incident to another junior developer on the product Y team.
For teams working with a platform like Opsgenie, you can also set up rules that tell the system to automatically escalate an incident if the primary on-call person doesn’t acknowledge or doesn’t close an alert.
Some teams may favor one escalation method over another, but they aren’t mutually exclusive, and many teams use a mix of hierarchical, functional, and automatic escalation.
The escalation matrix
An escalation matrix is a document or system that defines when escalation should happen and who should handle incidents at each escalation level.
The term is used across a number of industries. Human resources may have an escalation matrix for internal issues. Call centers may have an escalation matrix for customer service issues. And IT and DevOps teams may have one or more matrices that help engineers know how and when to escalate an incident.
The level of detail in a matrix varies greatly from company to company. Some organizations might have a simple hierarchical chart, with each developer escalating to one with a higher skill level as needed. Other organizations might have situation-specific matrices that tell developers which teams to contact for different types of incidents or different severity levels. As with most things in incident management, there is no one-size-fits-all answer for how to develop your organization’s matrix
Good practices for developing an escalation policy
Treat your escalation policy as guidelines—not a hard and fast set of rules
Technology isn’t static and neither are your teams. Google suggests that if your SRE thinks a specific case calls for a different escalation strategy, give them the freedom to make that judgement call. The point here isn’t to create inflexible rules, but to create guidelines that apply in most situations.
Audit your on-call schedule regularly
Are there any gaps in the schedule? Do you have the right people on call? Do you have the right people on your second and third tiers of on-call? Your on-call schedules and escalation policy should work together for faster incident management.
Set smart thresholds for escalation
Not every incident is created equal, which means not every incident can or should follow the same escalation policy.
For minor incidents, you may not want to alert the on-call engineer until working hours. For major incidents, you probably need that engineer no matter what time of day it is. In the case of multiple incidents, your engineer will need to know what to tackle first and/or if they should escalate one incident to a second engineer immediately.
There’s a balancing act here between ensuring your systems have maximum uptime and meet their SLA promises and SLO goals and making sure engineers aren’t burnt out, overworked, sleep-deprived, and subject to alert fatigue.
Set clear processes for escalation
Should the escalating developer contact the appropriate team or person directly or do they need to go through the help desk? Is there a system in place that the developer should use? How will you track escalation? What responsibilities does the first responder have to make sure the incident is picked up by the next person?
These questions should be clearly addressed by your policy and clearly communicated to all on-call developers in order to keep escalations running smoothly and resolve incidents faster.
Learn more about how Jira Service Management can enrich your incident management practice by offering a collaborative solution to incident escalation to reach faster resolutions.
Setting up an on-call schedule with Opsgenie
In this tutorial, you’ll learn how to set up an on-call schedule, apply override rules, configure on-call notifications, and more, all within Opsgenie.Read this tutorial
A better approach to on-call scheduling
An effective on-call schedule is key to sustaining a healthy on-call culture. Learn common mistakes, types of rotation schedules, and how to get it right.Read this article