Incident management for high-velocity teams
A manager’s guide to improving on-call
Just like emergency rooms require on-call schedules for doctors to handle health emergencies, DevOps teams need them to efficiently respond to software and system issues that impact performance, deployment, and availability.
But developing an on-call practice is easier said than done. Being on-call can be a daunting and disruptive experience for employees. Finding the right balance of coverage, scalability, and quality of life for the team is an ongoing challenge.
As best practices shift and companies grow, the most agile, high-velocity teams are implementing—and succeeding with—new approaches.
You built it, you maintain it
As recent as ten years ago, responding to IT incidents was the primary job of operations teams. Organizations typically had a tiered team structure (i.e. Level 1, Level 2, Level 3, with higher skill levels—and pay levels—at the higher tiers).
The goal in adopting this structure was to reduce operations cost. Usually, Level 1 would involve entry-level employees. If Level 1 couldn’t resolve an issue, they’d escalate it to Level 2—made up of more senior (and therefore, more expensive) people. And this process went on until the issue was resolved.
But with the rise of always-on services, interdependencies between systems and customer expectations for uptime both rose too. These days, a slow response can cost the company more—in reputation, customer satisfaction, and lost revenue—than bringing senior-level developers into incidents earlier.
The result of this changing technology landscape is that the structure of response teams needed to change. Enter the DevOps movement and the concept of “you built it, you maintain it.”
The idea here is a simple one: the developer who is most familiar with the code is the best person to troubleshoot related issues in the shortest amount of time. Thanks to DevOps, this logic is why it’s now common for developers to be on-call, ensuring the code runs well and lowering the MTTA and MTTR of incidents.
The added benefit of this approach is more rigorous testing before deployment. Now that the developer in charge of the code could be alerted during off-hours, there’s a deeper sense of ownership, an added incentive to double- and triple-check the code. The result, more and more companies are finding, is more reliable and resilient systems.
Building an on-call practice teams won’t hate
On-call gets a bad rap—and sometimes with good reason. Unbalanced on-call programs can have a negative effect on work-life balance, health, and sleep. Employees with bad on-call experiences or no on-call experience may imagine their social lives and work-life balance evaporating before their eyes.
But the truth is that on-call doesn’t have to be a somber march toward lower quality of life. By balancing on-call duties, taking team preferences into account, and putting robust systems in place to prevent and reduce incidents and on-call alerts wherever possible, you can create a practice that minimizes and shares the burden across your teams.
For management to be successful at this means being transparent with teams up-front, providing plenty of training, setting fair expectations for on-call and development duties, developing robust processes, and constantly checking in and making improvements with the input and buy-in of the teams themselves.
Being transparent with your teams
Transparency is the key to successful communication. Clarifying expectations around availability is a must when rolling out an on-call system or a change to an existing on-call system. Make sure you think through and clearly answer common employee questions such as:
- Will engineers be on-call overnight?
- If on-call overnight, is there flexibility to work from home the next day? Can an on-call employee start later the next day if they need to catch up on sleep?
- Are developers responsible for doing development work during on-call time?
- How many times per month will a developer be on call? What’s the maximum number of times a single person would be on call?
- How will you compensate on-call employees?
Providing proper training
Best practices for training on-call teams include:
- Developing a training program that addresses both process and common issues
- Providing up-to-date runbooks
- Having new employees shadow experienced on-call engineers
- Giving employees access to past incident reports so they can see how past incidents similar to the one they’re dealing with were successfully resolved
It’s also a good idea to have multiple escalation channels. The typical best practice is to have junior engineers on the primary on-call rotation and schedule senior engineers as backup or secondary rotation. This helps junior engineers develop the required on-call skills while avoiding panic when there’s an issue beyond their expertise.
Keep on-call and development duties separate
Having development duties during on-call usually means lots of context switching and interruptions, especially for companies with frequent incidents and on-call requirements.
This all usually means less development efficiency and more stress for the on-call engineers and can lead to burnout, alert fatigue, and job dissatisfaction. It can also have a negative effect on development sprints, since it’s difficult to estimate how much an on-call person can and will contribute to any given sprint.
This is why, as a best practice, we recommend keeping on-call duties and development duties separate. When on-call employees have free time, they can work on improving on-call-related documentation and automation to eventually improve the sustainability of systems and services.
Fine-tune your on-call process
A healthy on-call system can only exist if it’s constantly improved by fine-tuning processes and systems. Customize on-call schedules, routing rules, and escalation policies with an incident management solution like Jira Service Management to handle alerts efficiently. Toward these goal, we recommend:
- Evaluating alert priority and urgency and setting systems based on that. Low urgency alerts can wait until morning, letting on-call employees get some much-needed sleep.
- Reducing false positives by classifying alerts based on factors such as root cause, originating system, message, thresholds, etc. This helps differentiate actionable alerts from the rest.
- De-duplicating related alerts to avoid alert fatigue.
- Designing rich alerts that clearly describe an issue and empower the on-call engineers to make effective decisions and apply the knowledge recorded in runbooks.
- Providing alert reports and metrics to on-call teams so that weak areas in systems can be identified and improved. (In other words: don’t let on-call teams get bogged down by the same issues over and over again.)
Review on-call reports—and adjust as needed
To keep things fair and avoid employee burnout, managers should review on-call-related reporting to see:
- How often each team member is paged or woken up
- How long each team member has been on call
- The hourly and daily distributions of on-call duty for each person
- Adjust schedules as needed to distribute work fairly.
Listen to your employees
Management should organize regular all-hands meetings with the on-call engineers to discuss problems, complaints, and areas of weakness—and then take action to resolve the issues.
On-call systems, tools, processes, people, documentation, and training aren’t static things you set and forget. As the company grows, teams learn and change, and incidents shift over time, management should be constantly reevaluating and improving their on-call programs.
The people best equipped to tell you what is and isn’t working are the on-call engineers. Listen to them. Implement changes. And, most importantly, make sure management isn’t the sole decision-maker when it comes to on-call organization and protocol. The more you empower teams to improve their own processes and practices, the more they’ll embrace on-call.
Developing a friendly on-call culture
On-call engineers carry a huge responsibility for the success of companies. So it’s no surprise that stress and tension are common problems, especially during major issues with unknown causes.
The on-call culture set by senior on-call engineers and management teams defines how people deal with that stress and tension and how they feel about being on call.
For both the on-call engineers’ sakes and the on-call culture of the company, management teams should pay attention to developing a friendly on-call culture and make it clear that the goal should always be to find the problems, risks, and weaknesses in systems and solve them.
At Atlassian, this means not only constantly improving our on-call systems, but also conducting blameless postmortems where the focus is on improvement—not finding someone to blame.
Discover Jira Service Management—a solution that supports positive on-call culture—and build a system with enhanced communication capabilities, centralized alerting, flexible automation, and advanced reporting to incident response to the next level.