Resources
Service management for IT Ops, development and business teams

Deliver high velocity service management at scale.

Get it free

Learn more

How to manage the end-to-end delivery of IT services

Check out tips to improve your service management practices.

Learn more

Everything you need to know to get setup on JSM

These guides cover everything from the basics to in-depth best practices.

View guide

Jira Service Management resource library

Browse through our whitepapers, case studies, reports, and more to get all the information you need.

View library

Resources
Service management for IT Ops, development and business teams

Deliver high velocity service management at scale.

Get it free

Learn more

How to manage the end-to-end delivery of IT services

Check out tips to improve your service management practices.

Learn more

Everything you need to know to get setup on JSM

These guides cover everything from the basics to in-depth best practices.

View guide

Jira Service Management resource library

Browse through our whitepapers, case studies, reports, and more to get all the information you need.

View library

Incident management for high-velocity teams

Atlassian Incident Handbook

Teams running tech services today are expected to maintain 24/7 availability.

When something goes wrong, whether it's an outage or a broken feature, team members need to respond immediately and restore service. This process is called incident management, and it’s an ongoing, complex challenge for companies big and small.

We want to help teams everywhere improve their incident management. Inspired by teams like Google, we've created this handbook as a summary of Atlassian's incident management process. These are the lessons we've learned responding to incidents for more than a decade. While it’s based on our unique experiences, we hope it can be adapted to suit the needs of your own team.

Get the handbook in print or PDF

We've got a limited supply of print versions of the Incident Management Handbook that we're shipping out for free. Or download a PDF version.

Get the handbook

What is an incident?

We define an incident as an event that causes disruption to or a reduction in the quality of a service which requires an emergency response. Teams who follow ITIL or ITSM practices may use the term major incident for this instead.

An incident is resolved when the affected service resumes functioning in its usual way. This includes only those tasks required to restore full functionality.

The incident postmortem is done after the incident to determine the root cause and assign actions to ensure it is addressed before it can cause a repeat incident.

Our incident values

A process for managing incidents can't cover all possible situations, so we empower our teams with general guidance in the form of values. Similar to Atlassian's company values, our incident values are designed to:

Guide autonomous decision-making by people and teams in incidents and postmortems.
Build a consistent culture between teams of how we identify, manage, and learn from incidents.
Align teams as to what attitude they should be bringing to each part of incident identification, resolution, and reflection.

Stage	Incident Value	Related Atlassian Value	Rationale
1. Detect	Atlassian knows before our customers do	Build with Heart and Balance	A balanced service includes enough monitoring and alerting to detect incidents before our customers do. The best monitoring alerts us to problems before they even become incidents.
2. Respond	Escalate, escalate, escalate	Play, As a team	Nobody likes being woken up and we don’t take the responsibility lightly. But people understand that occasionally they will be woken for an incident where it turns out they aren't needed. What’s usually harder is waking up to a major incident and playing catch up when you should have been alerted earlier. We won't always have all the answers, so "don't hesitate to escalate."
3. Recover	Shit happens, clean it up quickly	Don't !@#$ the Customer	Our customers don't care why their service is down, only that we restore service as quickly as possible. Never hesitate in getting an incident resolved quickly so that we can minimise impact to our customers.
4. Learn	Always Blameless	Open Company, No Bullshit	Incidents are part of running services. We improve services by holding teams accountable, not by apportioning blame.
5. Improve	Never have the same incident twice	Be the change you seek	Identify the root cause and the changes that will prevent the whole class of incident from occuring again. Commit to delivering specific changes by specific dates.

Tooling requirements

The incident management process described here uses several tools that are specific to Atlassian and can be substituted as needed:

Incident tracking - every incident is tracked as a Jira issue, with a followup issue created to track the completion of postmortems (Atlassian uses a heavily customized version of Jira Software for this).
Chat room - a real-time text communication channel is fundamental to diagnosing and resolving the incident as a team.
Video chat - for many incidents, team video chat like Blue Jeans can help you discuss and agree on approaches.
Alerting system - a tool such as OpsGenie manages on-call rotations and escalations.
Documentation tool - we use Confluence for our incident state documents and sharing postmortem via blogs.
Statuspage - communicating status with both internal stakeholders and customers through Statuspage helps keep everyone in the loop.

Incident tracking

Every incident is tracked as a Jira issue, with a followup issue created to track the completion of postmortems. The process in this handbook references our heavily customized version of Jira Software.

Incident issues are typically created by a support engineer in response to a customer ticket or by a developer recognizing a monitoring alert as being an incident. We urge people to create an issue if they're worried about something, rather than wait to escalate it.

In Jira, we have a simple workflow to track incidents through the resolution stage and to record all important actions taken during the incident response.

Incident manager

Each incident is driven by the incident manager (IM), who has overall responsibility for and authority for the incident. This person is indicated by the assignee on the incident issue. The incident manager is empowered to take any action necessary to resolve the incident, which includes paging anyone in the organization and keeping those involved in an incident focused on restoring service as quickly as possible.

The incident manager is a role, rather than an individual on the incident. The advantage of defining roles during an incident is that it allows people to become interchangeable. As long as a given person knows how to perform a certain role, they can take that role for any incident.

Tutorial

Setting up an on-call schedule with Opsgenie

In this tutorial, you’ll learn how to set up an on-call schedule, apply override rules, configure on-call notifications, and more, all within Opsgenie.

Read this tutorial

Up next

How we respond to an incident

Here's Atlassian's process for responding to incidents, from our handbook. Learn the steps an incident manager takes from detection to resolution.

Read this article

Featured

Jira

Confluence

Jira Service Management

Developers

Jira

Compass

Pipelines

Bitbucket

DX

Rovo Dev

Product Managers

Jira Product Discovery

Jira

Confluence

IT professionals

Jira Service Management

Guard

Business Teams

Jira

Confluence

Trello

Loom

Jira Service Management

Customer Service Management

Leadership Teams

Focus

Talent

Jira Align

Solutions

Why Atlassian

System of Work New

Integrations

Customers

FedRAMP

Resilience

Platform

Trust center

Resources

Customer Support

Find Partners

Atlassian Ascend

Community

support

Resources

Jira

Jira Service Management

Confluence

Jira Service Management

ITSM

Product guide

Resource library

Service management for IT Ops, development and business teams

How to manage the end-to-end delivery of IT services

Everything you need to know to get setup on JSM

Jira Service Management resource library

Jira Service Management

ITSM

Product guide

Resource library

Service management for IT Ops, development and business teams

How to manage the end-to-end delivery of IT services

Everything you need to know to get setup on JSM

Jira Service Management resource library

Incident management for high-velocity teams

Atlassian Incident Handbook

Get the handbook in print or PDF

Who is this guide for?

What is an incident?

Our incident values

Tooling requirements

Incident tracking

Incident manager

Setting up an on-call schedule with Opsgenie

How we respond to an incident

products

Resources

Learn