Resources
Service management for IT Ops, development and business teams

Deliver high velocity service management at scale.

Get it free

Learn more

How to manage the end-to-end delivery of IT services

Check out tips to improve your service management practices.

Learn more

Everything you need to know to get setup on JSM

These guides cover everything from the basics to in-depth best practices.

View guide

Jira Service Management resource library

Browse through our whitepapers, case studies, reports, and more to get all the information you need.

View library

Resources
Service management for IT Ops, development and business teams

Deliver high velocity service management at scale.

Get it free

Learn more

How to manage the end-to-end delivery of IT services

Check out tips to improve your service management practices.

Learn more

Everything you need to know to get setup on JSM

These guides cover everything from the basics to in-depth best practices.

View guide

Jira Service Management resource library

Browse through our whitepapers, case studies, reports, and more to get all the information you need.

View library

Incident management for high-velocity teams

Love DevOps? Wait until you meet SRE

You may have heard of a little company called Google. They invent cool stuff like driverless cars and elevators into outer space. Oh: and they develop massively successful applications like Gmail, Google Docs, and Google Maps. It’s safe to say they know a thing or two about successful application development, right?

They’re also the pioneers behind a growing movement called Site Reliability Engineering (SRE). SRE effectively ends the age-old battles between Development and Operations. It encourages product reliability, accountability, and innovation – minus the hallway drama you’ve come to expect in what can feel like Software Development High School.

How? Let’s look at the basics.

Use free DevOps template

What in the world is SRE?

Google’s mastermind behind SRE, Ben Treynor, still hasn’t published a single-sentence definition, but describes site reliability as “what happens when a software engineer is tasked with what used to be called operations.”

The underlying problem goes like this: Dev teams want to release awesome new features to the masses, and see them take off in a big way. Ops teams want to make sure those features don’t break things. Historically, that’s caused a big power struggle, with Ops trying to put the brakes on as many releases as possible, and Dev looking for clever new ways to sneak around the processes that hold them back. (Sounds familiar, I'd wager.)

SRE removes the conjecture and debate over what can be launched and when. It introduces a mathematical formula for green- or red-lighting launches, and dedicates a team of people with Ops skills (appropriately called Service Reliability Engineers, or SRE’s) to continuously oversee the reliability of the product. As Google’s own SRE Andrew Widdowson describes it, “Our work is like being a part of the world’s most intense pit crew. We change the tires of a race car as it’s going 100mph.”

Doesn’t sound revolutionary yet? Much of the magic is in how it works. Here are some of the core principles – which also happen to be some of the biggest departures from traditional IT operations.

First, new launches are green-lighted based on current product performance.

Most applications don’t achieve 100% uptime. So for each service, the SRE team sets a service-level agreement (SLA) that defines how reliable the system needs to be to end-users. If the team agrees on a 99.9% SLA, that gives them an error budget of 0.1%. An error budget is exactly as it’s named: it’s the maximum allowable threshold for errors and outages.

ProTip: You can easily convert SLAs into "minutes of downtime" with this cool uptime cheat sheet.

Here’s the clincher: The development team can “spend” this error budget in any way they like. If the product is currently running flawlessly, with few or no errors, they can launch whatever they want, whenever they want. Conversely, if they have met or exceeded the error budget and are operating at or below the defined SLA, all launches are frozen until they reduce the number of errors to a level that allows the launch to proceed.

The genius? Both the SREs and developers have a strong incentive to work together to minimize the number of errors.

SREs can code, too

In the old model, you throw people at a reliability problem and keep pushing (sometimes for a year or more) until the problem either goes away or blows up in your face.

Not so in SRE. Both the development and SRE teams share a single staffing pool, so for every SRE that is hired, one less developer headcount is available (and vice versa). This ends the never-ending headcount battle between Dev and Ops, and creates a self-policing system where developers get rewarded with more teammates for writing better performing code (i.e., code that needs less support from fewer SREs).

Illustration of people using a spotlight

SRE teams are actually staffed entirely with rock-star developer/sys-admin hybrids who not only know how to find problems, but fix them, too. They interface easily with the development team, and as code quality improves, are often moved to the development team if fewer SRE’s are needed on a project.

In fact, one of the core principles mandates that SRE’s can only spend 50% of their time on Ops work. As much of their time as possible should be spent writing code and building systems to improve performance and operational efficiency.

Developers get their hands dirty, too

At Google, Ben Treynor had to fight for this clause, and he’s glad he did. In fact, in his great keynote on SRE at SREcon14 he emphasizes that getting this commitment from your executives before you launch SRE should be mandatory.

Basically, the development team handles 5% of all operations workload (handling tickets, providing on-call support, etc.). This allows them to stay closely connected to their product, see how it is performing, and make better coding and release decisions.

In addition, any time the operations load exceeds the capacity of the SRE team, the overflow always gets assigned to the developers. When the system is working well, the developers begin to self-regulate here as well, writing strong code and launching carefully to prevent future issues.

SRE’s are free agents (and can be pulled, if needed)

To make sure teams stay healthy and happy, Treynor recommends allowing SRE’s to move to other projects as they desire, or even move to a different organization. SRE encourages highly motivated, dedicated, and effective teamwork – so no team member should be held back from pursuing his or her own personal objectives.

If an entire team of SREs and developers simply can’t get along and are creating more trouble than reliable code, there’s a final drastic measure you can take: Pull the entire SRE team off of the project, and assign all of the operations workload directly to the development team. Treynor has only done this a couple times in his entire career, and the threat is usually enough to bring both teams around to a more positive working relationship.

There’s quite a bit more to SRE than I can cover in once article – like how SRE prevents production incidents, how on-call support teams are staffed and the rules they follow on each shift, etc.

Our take

IT is full of buzzwords and trends, to be sure. One minute it’s cloud, the next it’s DevOps or customer experience or gamification. SRE is in a strong position to become much more than that, particularly since it is far more about the people and process than the technology that underlies them. While technology certainly can (and likely will) adapt to the concept as it matures and more teams adopt it, you don’t need new tools to align your development and operations organizations around the principles of Site Reliability Engineering.

In future articles, we’ll look at just that: practical steps for taking a step towards SRE, and the role technology can play.

About the author

Patrick Hill

I've been with Atlassian a while now, and recently transfered from Sydney to our Austin office. (G'day, y'all!) In my free time, I enjoy taking my beard from "distinguished professor" to "lumberjack" and back again. Find me on Twitter! @topofthehill

Tutorial

Setting up an on-call schedule with Opsgenie

In this tutorial, you’ll learn how to set up an on-call schedule, apply override rules, configure on-call notifications, and more, all within Opsgenie.

Read this tutorial

Up next

Incident communication templates and examples

When responding to an incident, communication templates are invaluable. Get the templates our teams use, plus more examples for common incidents.

Read this article

Featured

Jira

Confluence

Jira Service Management

Developers

Jira

Compass

Pipelines

Bitbucket

DX

Rovo Dev

Product Managers

Jira

Confluence

Jira Product Discovery

IT professionals

Jira Service Management

Guard

Business Teams

Jira

Confluence

Trello

Loom

Jira Service Management

Leadership Teams

Focus

Talent

Jira Align

Solutions

Why Atlassian

System of Work New

Integrations

Customers

FedRAMP

Resilience

Platform

Trust center

Resources

Customer Support

Find Partners

Atlassian Ascend

Community

support

Resources

Jira

Jira Service Management

Confluence

Jira Service Management

ITSM

Product guide

Resource library

Service management for IT Ops, development and business teams

How to manage the end-to-end delivery of IT services

Everything you need to know to get setup on JSM

Jira Service Management resource library

Jira Service Management

ITSM

Product guide

Resource library

Service management for IT Ops, development and business teams

How to manage the end-to-end delivery of IT services

Everything you need to know to get setup on JSM

Jira Service Management resource library

Incident management for high-velocity teams

Love DevOps? Wait until you meet SRE

What in the world is SRE?

First, new launches are green-lighted based on current product performance.

SREs can code, too

Developers get their hands dirty, too

SRE’s are free agents (and can be pulled, if needed)

Our take

About the author

Setting up an on-call schedule with Opsgenie

Incident communication templates and examples

products

Resources

Learn