Close

Incident management for high-velocity teams

What is an error budget—and why does it matter?

Every development, operations, and IT team knows that sometimes incidents happen.

Even the biggest companies with the brightest talent and a reputation for nearly 100% uptime sometimes watch in frustration as their systems go down. Just look at Apple, Delta, or Facebook, all have lost tens of millions to incidents in the past five years.

This reality means Service Level Agreements (SLAs) should never promise 100% uptime. Because that’s a promise no company can keep.

It also means that if your company is very good at avoiding or resolving incidents, you might consistently knock your uptime goals out of the park. Perhaps you promise 99% uptime and actually come closer to 99.5%. Perhaps you promise 99.5% uptime and actually reach 99.99% on a typical month.

When that happens, industry experts recommend that instead of setting user expectations too high by constantly overshooting your promises, you consider that extra .99% an error budget—time that your team can use to take risks.

What is an error budget?

An error budget is the maximum amount of time that a technical system can fail without contractual consequences.

For example, if your Service Level Agreement (SLA) specifies that systems will function 99.99% of the time before the business has to compensate customers for the outage, that means your error budget (or the time your systems can go down without consequences) is 52 minutes and 35 seconds per year.

If your SLA promises 99.95% uptime, your error budget is four hours, 22 minutes, and 48 seconds. And with an SLA promise of 99.9% uptime, your error budget is eight hours, 46 minutes, and 12 seconds.

Why do tech teams need error budgets?

At first glance, error budgets don’t seem that important. They’re just another metric IT and DevOps need to track to make sure everything’s running smoothly, right?

The answer, fortunately, is no. Error budgets aren’t just a convenient way to make sure you’re meeting contractual promises. They’re also an opportunity for development teams to innovate and take risks.

As we explain in our SRE article,

“The development team can ‘spend’ this error budget in any way they like. If the product is currently running flawlessly, with few or no errors, they can launch whatever they want, whenever they want. Conversely, if they have met or exceeded the error budget and are operating at or below the defined SLA, all launches are frozen until they reduce the number of errors to a level that allows the launch to proceed.”

The benefit of this approach is that it encourages teams to minimize real incidents and maximize innovation by taking risks within acceptable limits. It also bridges the gap between development teams, whose goals are innovation and agility, and operations, who are concerned with stability and security. As long as downtime remains low, developers can remain agile and push changes without friction from operations.

Why do tech teams need error budgets?

At first glance, error budgets don’t seem that important. They’re just another metric IT and DevOps need to track to make sure everything’s running smoothly, right?

The answer, fortunately, is no. Error budgets aren’t just a convenient way to make sure you’re meeting contractual promises. They’re also an opportunity for development teams to innovate and take risks.

As we explain in our SRE article,

“The development team can ‘spend’ this error budget in any way they like. If the product is currently running flawlessly, with few or no errors, they can launch whatever they want, whenever they want. Conversely, if they have met or exceeded the error budget and are operating at or below the defined SLA, all launches are frozen until they reduce the number of errors to a level that allows the launch to proceed.”

The benefit of this approach is that it encourages teams to minimize real incidents and maximize innovation by taking risks within acceptable limits. It also bridges the gap between development teams, whose goals are innovation and agility, and operations, who are concerned with stability and security. As long as downtime remains low, developers can remain agile and push changes without friction from operations.

How to use an error budget

First, you’ll need to consult your SLAs and SLOs. What objectives have you already set for uptime or successful system requests? What promises has your company made to clients? Those will dictate your error budget.

Error budgets based on uptime

Most teams monitor uptime on a monthly basis. If availability is above the number promised by the SLA/SLO, the team can release new features and take risks. If it’s below the target, releases halt until the target numbers are back on track.

To use this method effectively, you’ll need to translate your SLO target (usually a percentage) into real figures your developers can work within. This means calculating how many hours and minutes your 1% or .5% or .1% of allowed downtime actually translates to. Common targets include:

Error budgets based on successful requests

SLOs get less hate than SLAs, but they can create just as many problems if they’re vague, overly complicated, or impossible to measure. The key to SLOs that don’t make your engineers want to tear their hair out is simplicity and clarity. Only the most important metrics should qualify for SLO status, the objectives should be spelled out in plain language, and, as with SLAs, they should always account for issues such as client-side delays.

Stay on top of SLAs to resolve requests based on priorities, and use automated escalation rules to notify the right team members and prevent SLA breaches with Jira Service Management.

Error budgets based on successful requests

SLOs get less hate than SLAs, but they can create just as many problems if they’re vague, overly complicated, or impossible to measure. The key to SLOs that don’t make your engineers want to tear their hair out is simplicity and clarity. Only the most important metrics should qualify for SLO status, the objectives should be spelled out in plain language, and, as with SLAs, they should always account for issues such as client-side delays.

Stay on top of SLAs to resolve requests based on priorities, and use automated escalation rules to notify the right team members and prevent SLA breaches with Jira Service Management.

Error budgets based on successful requests

SLOs get less hate than SLAs, but they can create just as many problems if they’re vague, overly complicated, or impossible to measure. The key to SLOs that don’t make your engineers want to tear their hair out is simplicity and clarity. Only the most important metrics should qualify for SLO status, the objectives should be spelled out in plain language, and, as with SLAs, they should always account for issues such as client-side delays.

Stay on top of SLAs to resolve requests based on priorities, and use automated escalation rules to notify the right team members and prevent SLA breaches with Jira Service Management.

Error budgets based on successful requests

SLOs get less hate than SLAs, but they can create just as many problems if they’re vague, overly complicated, or impossible to measure. The key to SLOs that don’t make your engineers want to tear their hair out is simplicity and clarity. Only the most important metrics should qualify for SLO status, the objectives should be spelled out in plain language, and, as with SLAs, they should always account for issues such as client-side delays.

Stay on top of SLAs to resolve requests based on priorities, and use automated escalation rules to notify the right team members and prevent SLA breaches with Jira Service Management.

Error budgets based on successful requests

SLOs get less hate than SLAs, but they can create just as many problems if they’re vague, overly complicated, or impossible to measure. The key to SLOs that don’t make your engineers want to tear their hair out is simplicity and clarity. Only the most important metrics should qualify for SLO status, the objectives should be spelled out in plain language, and, as with SLAs, they should always account for issues such as client-side delays.

Stay on top of SLAs to resolve requests based on priorities, and use automated escalation rules to notify the right team members and prevent SLA breaches with Jira Service Management.

Error budgets based on successful requests

SLOs get less hate than SLAs, but they can create just as many problems if they’re vague, overly complicated, or impossible to measure. The key to SLOs that don’t make your engineers want to tear their hair out is simplicity and clarity. Only the most important metrics should qualify for SLO status, the objectives should be spelled out in plain language, and, as with SLAs, they should always account for issues such as client-side delays.

Stay on top of SLAs to resolve requests based on priorities, and use automated escalation rules to notify the right team members and prevent SLA breaches with Jira Service Management.

Reliability vs. availability: Frequently asked questions

What is an example of reliability vs. availability?

Consider new technology like driverless cars. Service reliability standards are near or at 100% because a single failure can result in injury or death. 

Conversely, the availability of driverless cars affects the user experience. The higher the availability, or operational time, the better the experience. Low availability may cause the business to lose market share, but it is unlikely to result in injury or death.

Why are reliability and availability important?

Both reliability and availability impact a business’s bottom line because they affect customer satisfaction. In addition, systems that are not available or reliable cost companies money in lost revenue, spoilage, unplanned maintenance costs, and lost productivity.

Focusing efforts to increase service reliability and availability can result in a greater competitive advantage, an increased market share, better revenue, and an improved budgeting plan for maintenance costs.

What are the trade-offs between reliability and availability?

Businesses sometimes have to prioritize reliability over availability or vice versa. Real trade-offs may be necessary when timelines are short or investment funds are limited.

In the case of driverless cars, businesses are likely to invest more time and effort in increased reliability, even if it negatively impacts availability. However, in less critical situations, such as online retail, a business may focus on increasing availability because being “always open” is one of the key differentiators between e-commerce and brick-and-mortar competitors.

Up Next
DevOps