Close

Incident management for high-velocity teams

What is an error budget—and why does it matter?

Every development, operations, and IT team knows that sometimes incidents happen. 

Even the biggest companies with the brightest talent and a reputation for nearly 100% uptime sometimes watch in frustration as their systems go down. Just look at Apple, Delta, or Facebook, all have lost tens of millions to incidents in the past five years. 

This reality means Service Level Agreements (SLAs) should never promise 100% uptime. Because that’s a promise no company can keep. 

It also means that if your company is very good at avoiding or resolving incidents, you might consistently knock your uptime goals out of the park. Perhaps you promise 99% uptime and actually come closer to 99.5%. Perhaps you promise 99.5% uptime and actually reach 99.99% on a typical month.

When that happens, industry experts recommend that instead of setting user expectations too high by constantly overshooting your promises, you consider that extra .99% an error budget—time that your team can use to take risks.

What is an error budget?

An error budget is the maximum amount of time that a technical system can fail without contractual consequences. 

For example, if your Service Level Agreement (SLA) specifies that systems will function 99.99% of the time before the business has to compensate customers for the outage, that means your error budget (or the time your systems can go down without consequences) is 52 minutes and 35 seconds per year.

If your SLA promises 99.95% uptime, your error budget is four hours, 22 minutes, and 48 seconds. And with an SLA promise of 99.9% uptime, your error budget is eight hours, 46 minutes, and 12 seconds.

Why do tech teams need error budgets?

At first glance, error budgets don’t seem that important. They’re just another metric IT and DevOps need to track to make sure everything’s running smoothly, right?

The answer, fortunately, is no. Error budgets aren’t just a convenient way to make sure you’re meeting contractual promises. They’re also an opportunity for development teams to innovate and take risks. 

As we explain in our SRE article, 

“The development team can ‘spend’ this error budget in any way they like. If the product is currently running flawlessly, with few or no errors, they can launch whatever they want, whenever they want. Conversely, if they have met or exceeded the error budget and are operating at or below the defined SLA, all launches are frozen until they reduce the number of errors to a level that allows the launch to proceed.”

The benefit of this approach is that it encourages teams to minimize real incidents and maximize innovation by taking risks within acceptable limits. It also bridges the gap between development teams, whose goals are innovation and agility, and operations, who are concerned with stability and security. As long as downtime remains low, developers can remain agile and push changes without friction from operations.

How to use an error budget

First, you’ll need to consult your SLAs and SLOs. What objectives have you already set for uptime or successful system requests? What promises has your company made to clients? Those will dictate your error budget.

Error budgets based on uptime

Most teams monitor uptime on a monthly basis. If availability is above the number promised by the SLA/SLO, the team can release new features and take risks. If it’s below the target, releases halt until the target numbers are back on track. 

To use this method effectively, you’ll need to translate your SLO target (usually a percentage) into real figures your developers can work within. This means calculating how many hours and minutes your 1% or .5% or .1% of allowed downtime actually translates to. Common targets include:

SLA target

Yearly allowed downtime

Monthly allowed downtime

99.99% uptime

Yearly allowed downtime

52 minutes, 35 seconds

Monthly allowed downtime

4 minutes, 23 seconds

99.95% uptime

Yearly allowed downtime

4 hours, 22 minutes, 48 seconds

Monthly allowed downtime

21 minutes, 54 seconds

99.9% uptime

Yearly allowed downtime

8 hours, 45 minutes, 57 seconds

Monthly allowed downtime

43 minutes, 50 seconds

99.5% uptime

Yearly allowed downtime

43 hours, 49 minutes, 45 seconds

Monthly allowed downtime

3 hours, 39 minutes

99% uptime

Yearly allowed downtime

87 hours, 39 minutes

Monthly allowed downtime

7 hours, 18 minutes

Error budgets based on successful requests

SLOs get less hate than SLAs, but they can create just as many problems if they’re vague, overly complicated, or impossible to measure. The key to SLOs that don’t make your engineers want to tear their hair out is simplicity and clarity. Only the most important metrics should qualify for SLO status, the objectives should be spelled out in plain language, and, as with SLAs, they should always account for issues such as client-side delays.

Stay on top of SLAs to resolve requests based on priorities, and use automated escalation rules to notify the right team members and prevent SLA breaches with Jira Service Management.

Up Next
DevOps