The role of on call teams in an always-on world
Pros and cons of different approaches to on-call management
The world depends on always on services more than ever before. An outage can affect millions of people, with real impact: They can’t pay their bills, they can’t book their flights, they can’t video call with their friends.
And whether you’re having a major bug, capacity issues, or you’re down completely, customers who depend on your services expect an immediate response. (The same is true for internal teams.)
Incidents can have a real impact not only in dollar terms — they cost businesses $700 billion per year in North America alone — but also on the reputation of your company, your product, and your team.
With so much at stake, teams have turned to on putting IT and developer teams on call to make sure the organization has the right people available to address a problem during an incident, no matter when one occurs.
A fair on-call schedule, coupled with an on-call compensation plan, can even foster a culture of shared responsibility and help your teams learn more about what it takes to make resilient software and services, making for a better overall product and fewer outages.
What is on call?
On call is the practice of designating specific people to be available at specific times to respond in the event of an urgent service issue, even though they are not formally on duty.
On call is a critical responsibility inside many IT, developer, support, and operations teams who run services where customers expect 24/7 availability. Team members take turns staffing an on-call rotation, either providing coverage around the clock or only outside of normal business hours. Along with automated monitoring and alerting solutions, the on-call engineer is empowered to respond immediately to any interruptions to service availability.
The rising importance of on call for IT and software teams
Sometimes on-call work gets a bad rap. Some veteran IT workers have horror stories about working on teams that were stretched too thin and didn't get the support they needed to properly respond to incidents.
A lot of that anxiety can be alleviated if on call support is done right. With an effective on call plan you can ensure your team can scale to match expanding services, providing consistent coverage for critical IT functions, and prompt incident response.
There are more benefits to a good on-call management plan than just getting through downtime. With each failure, teams get the opportunity to learn new skills, like understanding a critical service a little better, seeing how it responds to failure, and knowing how to design for fewer failures or improve the incident response plan.
And having a good on-call program built on a culture of shared responsibility can also lead to improved camaraderie and less burnout, which in turn can mean higher employee retention.
Pros and cons to being on call
In organizations that practice DevOps, software teams are taking a lot of the responsibility for the reliability and availability of the services they build, a job that used to be the exclusive domain of operations teams. For many of these teams “you build it, you run it” is the new motto. Being most familiar with the code, developers are often the ones who can best troubleshoot issues in the shortest amount of time.
And, through this process, developers build better software that is actually less likely to fail. With this shift in responsibility, they test their code more rigorously since they may in fact be the one brought in during off hours if the service has issues.
The result is more resilient systems and, with more people available and capable to take on incidents, fewer burned out workers.
Without a good on-call program, organizations will fail to realize all the cultural benefits of DevOps—or meet the demands of a scaling infrastructure. If one team bears the burden of responding to incidents more than another they won’t have the capacity to do their day jobs well. Developers won’t get to implement the feedback that comes from incidents, and incident responders won’t have the capacity to fortify their systems.
If the responsibilities are lopsided, those people slated for the on-call schedule are never really able to detach from work and can easily succumb to burnout.
But a plan that takes into consideration the org’s true coverage requirements, balances the time burden across the developer and IT ops teams, and captures data for continuous improvement can lead to benefits all around. It will not only lead to a better service for customers, it can also help employees improve their skills and their product and actually look forward to putting in on-call hours.
How to improve on-call developer roles
“I can’t wait to spend my evening overseeing this deployment and responding to potential outages!” —said no engineer, ever.
With more developers taking on the role of maintaining the services they build, it’s important to make sure they are prepared for their on call responsibilities, and the best time to assess this is during the hiring process.
Now, it’s no secret that there is intense competition for top engineering talent. And not everyone is motivated by money alone, so throwing more pay at devs for after-hours work may not close the deal (more about on call compensation later on). Software engineers in the interview process will naturally have questions about how often they’ll need to take time out of their personal lives and be on the on-call schedule.
Demonstrating that you have a documented on-call plan that spreads responsibilities out fairly across a competent team of developers and SREs can go a long way in reassuring new recruits that your organization has its on call management under control. With a documented plan you can be completely transparent in the interview process and make sure candidates are ready for the commitment to on-call work.
Five simple ways to make on call more developer friendly
- Clearly define the on-call responsibilities
Responsibilities during on call should be clearly defined. This helps prevent burnout, confusion, and frustration. We suggest documenting your incident response process and expectations for what it means to be on call.
- Make sure alerts are being assigned to the right person
Getting your alerting tooling dialed in effectively shouldn’t be overlooked. Making sure to have clear altering flow with the right notifications and overrides can avoid a lot of headaches.
- Have primary and secondary responders
Life doesn’t stop just because someone is on call. Just like an unexpected personal emergency can take a developer offline during the work day, the same can happen when they’re on call. Putting a backup in place limits the potential damage from this kind of interruption.
- Fine-tune your schedules
Teams are not static things, neither should be your on-call schedule. We recommend a culture of continuously reviewing, adjusting, and improving your on-call practices.
- Make sure they have access and familiarity with all the relevant diagnostics tools
Every team varies in the tools they use to track operational health, application performance, resource utilization, etc., Make sure your on-call engineers are familiar with the tools used and have proper access to them.
How to improve on call for IT support and service roles
It isn’t just developers spending more time on call. Increasingly for IT support and IT service teams, around-the-clock support is critical to helping the business function.
These teams face a lot of the same challenges as developers on call: stress, burnout, unclear roles and responsibilities, access to tooling.
IT teams often have the added stress of often being in the same building as their customers, who can slow things down with a flood of interruptions (email, Slack, even in-person) about the incident.
Here are a few tactics to help keep IT incidents manageable:
- Prompt and transparent communication: Proactively communicating IT incidents shows you care – and that you're in control.
- Keep track of what counts: Most IT service teams are using some form of service desk software. It’s critical that you aren’t just using free-form data entry fields to capture the details of each ticket.
- Put a monitoring system in place: Historically, many IT Ops teams would personally monitor performance dashboards to keep an eye out for outages. Do the team a favor and let monitoring and alerting tools handle this.
A good on call compensation plan rewards your employees for their expertise and time spent working after hours. If employees feel well-cared for, they will, in turn, care about the business and contribute to its success.
According to the U.S. Fair Labor Standard Act (FLSA), a federal law that sets minimum wage, overtime, and minimum age requirements for employers and employees, if an employee is on call but free to do as they wish with their time they’re considered “waiting to be engaged,” and therefore aren’t working.
If someone has their free time restricted and can’t do as they wish on their off hours, according to the FSLA this on call time may be considered “hours worked” and be eligible for compensation.
Your local laws may vary, so be sure to consult an expert. From there, aim for an on call compensation plan that’s competitive and fair, and supports a culture of shared responsibility.
Different types of on-call compensation plans
1. Incentivized on call
Incentivized on call compensation plans reward employees who raise their hands to work on call hours in exchange for extra days off, flexible hours, higher base salaries, or some combination of these things.
The advantage to this approach to on-call compensation is an increased sense of ownership over the services, which can lead to more resilient systems.
And giving ample time off and paying competitively also lets employees know their work is valued and appreciated, preventing burnout and reducing turnover.
2. Paid on call for scheduled overtime
Paid on call compensation means employees are directly compensated for the time they spend on call or scheduled to work, even if no issues arise during their shift.
The obvious advantage of this model of on-call compensation is the tangible incentive. Knowing you are getting paid for carrying a pager (or, more likely, a laptop and a cell phone) makes it easier to justify the burden of being on call and available, even if no issues arise.
3. Paid on call for the time spent on the issues
Another approach to on-call compensation is paying employees only when they work on an incident. Some ways to calculate this are:
- Total amount paid for working on call
- Hourly rate for time spent working on alerts/issues
- Rate for the number of alerts and issues worked
The advantage to this model is that employees are paid for the extra work they put in outside of normal business hours. A potential drawback is that there is a financial disincentive to reducing alerts and issues, which could compromise the overall integrity of the systems.
4. Paid on call for scheduled overtime and time spent on the issues
This is a combination of the two previous models. Some companies pay both for being on the on call schedule and an additional amount for alerts received and issues worked. The upside to this on call compensation model is that employees feel well-compensated for the extra time and effort that the organization asks of them. Additionally, if someone gets stuck with a particularly difficult issue that eats into their personal time, they’re financially compensated for their sacrifice. But again, consider if it makes sense in your company culture to create an indirect reward for having bugs in the software.
Other things to consider
These are the typical models for on call compensation plans. Some other things to consider, as appropriate are:
- Number of alerts received on and off-hours
This number is critical to determine if you need on call schedule coverage after business hours, or a special on call team during business hours.
- Time spent working on the incidents
The complexity and importance of your organization’s incidents can vary. An on call engineer may spend a couple of minutes on an issue or could spend the entire night firefighting an incident. The amount of time and effort put in during a typical on call shift should be taken into consideration. This needs to be measured for fair compensation.
- Mean time to acknowledge or resolve
Enforced by escalation policies, time to acknowledge is critical for fast resolution. Measuring the mean time to acknowledge and resolve over a period of time helps managers decide on additional incentives.