Close

Incident management for high-velocity teams

Responding to an incident

The following sections describe Atlassian's process for responding to incidents. The incident manager (IM) goes through this series of steps to drive the incident from detection to resolution.

Incident response workflow

Detect

People at your company can become aware of incidents in many ways. They can be alerted by monitoring, through customer reports, or by observing it themselves. However an incident occurs, the first step the team takes is logging an incident ticket (in our case, a Jira issue). 

Incident Management Handbook

Get the handbook in print or PDF

We've got a limited supply of print versions of our Incident Management Handbook that we're shipping out for free. Or download a PDF version.

We use an easy-to-remember short URL that redirects Atlassians to an internal Jira Service Management portal. Atlassians can check if there's an incident already in progress by looking at a Jira dashboard or a Jira macro in Confluence. Teams such as our customer support teams have dashboards set up at well-known locations to monitor incidents in progress.

We fill in the following fields for every incident:

Jira field Type Help text
Summary Text

What's the emergency?

Description Text

What's the impact on customers? Include your contact details so responders can reach you

Severity Single-select

(Hyperlink to a Confluence page with our severity scale on it) Choosing Sev 2 or 1 means you believe this must be resolved right now - people will be paged.

Faulty service Single-select

The service that has the fault that's causing the incident. Take your best guess if unsure. Select "Unknown" if you have no idea.

Affected products Checkboxes Which products are affected by the incident? Select any that apply

Once the incident is created, its issue key is used in all internal communications about the incident.

Customers will often open support cases about an incident that affects them. Once our customer support teams determine that these cases all relate to an incident, they label those cases with the incident's issue key in order to track the customer impact and to more easily follow up with affected customers when the incident is resolved.


Raise a new incident

When the incident issue has been created but hasn't yet been assigned to an incident manager (IM), the incident is in a new state. This is the initial status in our Jira incident workflow.

We have a service that uses Jira webhooks to trigger a page alert when a new major incident is created. This alerts an on-call IM based on the service that was selected. For example, an incident with Bitbucket will page a Bitbucket incident manager. We also have a global catch-all roster of major incident managers known as "incident manager on call" or IMOC.

The page alert includes the incident's issue key, severity, and summary, which tells the incident manager where to go to start managing the incident (the Jira issue), what's generally wrong, and how severe it is.


Open comms

The first thing the IM does when they come online is assigning the incident issue to themselves, and progress the issue to the fixing state. The Jira issue assignee field also shows who the current IM is. In an emergency response, it's very important to be clear who's in charge, so we're pretty strict about making sure this field is accurate. 

Next, the IM sets up the incident team's communication channels. The goal at this point is to establish and focus all incident team communications in well-known places. We normally use three team communication methods, each of which are represented by a field on the Jira issue, for every incident:

  • Chat room in Slack or another messaging service. This allows the team to communicate, share observations, links, and screenshots in a way that is timestamped and preserved. Giving the chat channel the same name as the issue key (e.g. HOT-1234) makes it easier for people who need to be involved to join the conversation. 
  • Video chat in a conferencing app like Skype, Blue Jeans or similar; or if you're all in the same place, gather the team in a physical room. We find that face-to-face communication helps teams work through things faster and with less back-and-forth.
  • Confluence page called the "incident state document". When people simultaneously edit the same page, they can see what info is being gathered in real time. This is a great way to keep track of changes (for example, a table of who changed what, when, how, why, how to revert, etc), multiple streams of work, or an extended timeline. An incident state document is extremely useful as the source of truth during complex or extended incidents.

We've found that using both video chat and chat room works best during an incident, as both are optimized for different things. Video chat excels at creating a shared mental picture of the incident quickly through group discussion, while text chat is great for keeping a timestamped record of the incident, shared links to dashboards, screenshots, and other URLs.

These methods can also be used to record important observations, changes, and decisions that happen in unrecorded conversations. The IM or anyone on the incident team does this by simply noting observations, changes, and decisions in the dedicated chatroom as they happen in real-time. It's okay if it looks like people are talking to themselves! These notes are incredibly valuable during the postmortem when teams need to reconstruct the incident timeline and figure out the thing that caused it. 

Most chat systems have a room topic feature. The IM updates the room topic with information about the incident and useful links, including:

  • The incident summary and severity.
  • Who is in what role, starting with the IM.
  • Links to the incident issue, the video chat room, and the incident state document.

This allows anyone with the incident's issue key to join the chat and come up to speed on the incident (remember that we named the chat channel based on the incident's issue key, e.g. HOT-1234).

Finally, the IM sets their own personal chat status to the issue key of the incident they are managing. This lets their colleagues know that they're busy managing an incident. 


Assess

After the incident team's communication channels are set up, it's time to assess the incident so the team can decide what to tell people about it and who needs to fix it. 

We have the following set of questions that IMs ask their teams: 

  • What is the impact to customers (internal or external)?
  • What are customers seeing?
  • How many customers are affected (some, all)?
  • When did it start?
  • How many support cases have customers opened?
  • Are there other factors, e.g. Twitter, security, or data loss?

Now is a good time to start adding to the incident's timeline. Record the team's observations so that people joining can come up to speed. This is also important later on in the postmortem process. Make sure to note whether the incident's start time corresponds with a change (for example, a Bamboo deployment) so that change can be rolled back to potentially resolve the incident.

Based on the impact of the incident and amount of work our teams think it will take to resolve, we assign issues with one of the following severity levels

Severity Description Examples
1 A critical incident with very high impact
  • A customer-facing service, like Jira Cloud, is down for all customers
  • Confidentiality or privacy is breached.
  • Customer data loss.
2 A major incident with significant impact
  • A customer-facing service is unavailable for a subset of customers
  • Core functionality (e.g. git push, issue create) is significantly impacted.
3 A minor incident with low impact
  • A minor inconvenience to customers, workaround available.
  • Usable performance degradation.

Once you establish the impact of the incident, adjust or confirm the severity of the incident issue and communicate that severity to the team. We've found numbering the level to be very beneficial in clearly communicating severity.

At Atlassian, severity 3 incidents are passed to the delivery teams for resolution during business hours, whereas severity 1 & 2 require paging team members for an immediate fix. The difference in response between severity 1 and 2 is more nuanced and dependent on the affected service.

Your severity matrix should be documented and agreed among all your teams to have a consistent response to incidents based on customer impact.


Send initial comms

When you're reasonably confident that the incident is real, you want to communicate it internally and externally as soon as you can. The goal of initial internal communication is to focus the incident response on one place and reduce confusion. The goal of external communication is to tell customers that you know something's broken and you're looking into it as a matter of urgency. Communicating quickly and accurately about incidents helps to build trust with your staff and customers.

We use Statuspage for incident communications both internally and externally. We have separate status pages for internal company staff and external customers. We'll talk more about how to use each one later on, but for now, the goal is to get communications up as quickly as possible. In order to do that, we follow these templates:

  Internal Statuspage External Status page
Incident name

- -

Investigating issues with

Message We are investigating an incident affecting , and . We will provide updates via email and Statuspage shortly.

We are investigating issues with and will provide updates here soon.

In addition to creating a Statuspage incident, we send an email to an incident communications distribution list that includes our engineering leadership, major incident managers, and other interested staff. This email has the same content as the internal Statuspage incident. Email allows staff to reply and ask questions, whereas Statuspage is more like one-way broadcast communication.

Note that we always include the incident's Jira issue key on all internal communications about the incident, so staff knows what chatroom to pop into for more questions.


Escalate

You've taken command of the incident, established team communications, assessed the situation, and told staff and customers that an incident is in progress. What's next?

Your first responders might be all the people you need in order to resolve the incident, but more often than not, you need to bring other teams into the incident by paging them. We call this escalation.

The key system in this step is a page rostering and alerting tool like OpsGenie. OpsGenie and similar systems allow you to define on-call rosters so that any given team has a rotation of staff who are expected to be contactable to respond in an emergency. This is superior to needing a specific individual all the time ("get Bob again") because individuals won't always be available (they tend to go on vacation from time to time, change jobs, or burn out when you call them too much). It is also superior to "best efforts" on-call because it's clear which individuals are responsible for responding.

Always include the incident's Jira issue key on a page alert about the incident. This is the key that the person receiving the alert uses to join the incident's chat room.


Delegate

After you escalate to someone and they come online, the IM delegates a role to them. As long as they understand what's required of their role, they will be able to work quickly and effectively as part of the incident team.

The roles we use at Atlassian are:

  • Incident Manager, described in the Overview page.
  • Tech Lead, a senior technical responder. Responsible for developing theories about what's broken & why, deciding on changes, and running the technical team. Works closely with the IM.
  • Communications Manager, a person familiar with public communications, possibly from the customer support team or public relations. Responsible for writing and sending internal and external communications about the incident.

We use the chat room's topic to show who is currently in which role, and this is kept up-to-date if roles change during an incident.

The IM can also devise and delegate roles as required by the incident, for example, multiple tech leads if more than one stream of work is underway, or separate internal and external communications managers.

In complicated or large incidents it's advisable to bring on another qualified incident manager as a backup "sanity check" for the IM. They can focus on specific tasks that free up the IM, such as keeping the timeline.


Send followup comms

You already sent out initial communications, but once the incident team is rolling you have to update staff and customers on the incident.

Updating internal staff is important because it creates a consistently shared truth about the incident. When something goes wrong information about it is scarce, especially during early stages, and if you don't establish a reliable source of truth about what's happened and how you're responding people tend to jump to their own conclusions. 

For internal communications, we follow this pattern:

  • We communicate via our internal Statuspage and via email, as described under "Initial Communications" above.
  • Use the same convention for incident name & email subject formatting ( - - )
  • Open with a 1-2 sentence summary of the current state and impact.
  • A "Current Status" section with 2-4 bullet points.
  • A "Next Steps" section with 2-4 bullet points.
  • State when & where the next round of communications will be sent out.

We use this checklist to review the communications for completeness: 

  • What is the actual impact on customers?
  • How many internal and external customers are affected?
  • If the root cause is known, what is it?
  • If there is an ETA for restoration, what is it?
  • When & where will the next update be?

We encourage our incident managers to be explicit about unknowns in their internal communications. This reduces uncertainty. For example, if you don't know what the root cause is yet, it's far better to say "the root cause is currently unknown" than to simply omit any mention of it.

Updating external customers is important because it helps to build trust. Even though they might be impacted they'll be able to get on with other things as long as they know you'll keep them up-to-date.

For external communications we simply update the incident that we opened on the external Statuspage earlier, transitioning its status as appropriate. We try to keep updates "short and sweet" because external customers aren't interested in the technical details of the incident - they just want to know if it's fixed yet and if not when it will be. Generally, 1-2 sentences will suffice.

Incident communications is an art, and the more practice you have, the better you'll be. In our incident manager training, we role-play a hypothetical incident, draft communications for it, and read them to the rest of the class. This is a good way to build this skill before doing it for real. It's also always a good idea to get someone else to review your communications as a "second opinion" before you send them.


Review

There's no single prescriptive process that will resolve all incidents - if there were, we'd simply automate that and be done with it. Instead, we iterate on the following process to quickly adapt to a variety of incident response scenarios: 

  • Observe what's going on. Share and confirm observations.
  • Develop theories about why it's happening. 
  • Develop experiments that prove or disprove those theories. Carry those out.
  • Repeat.

For example, you might observe a high error rate in a service corresponding with a fault that your regional infrastructure provider has posted on their Statuspage. You might theorize that the fault is isolated to this region, decide to fail over to another region, and observe the results.

The biggest challenges for the IM at this point are around maintaining the team's discipline:

  • Is the team communicating effectively?
  • What are the current observations, theories, and streams of work?
  • Are we making decisions effectively?
  • Are we making changes intentionally and carefully? Do we know what changes we're making?  
  • Are roles clear? Are people doing their jobs? Do we need to escalate to more teams?

In any case, don't panic - it doesn't help. Stay calm and the rest of the team will take that cue.

The IM has to keep an eye on team fatigue and plan team handovers. A dedicated team can risk burning themselves out when resolving complex incidents - IMs should look out for how long members have been awake for and how long they've been working on the incident for, and decide who's going to fill their roles next.


Resolve

An incident is resolved when the current or imminent business impact has ended. At that point, the emergency response ends and the team transitions onto any cleanup tasks and the postmortem.

Cleanup tasks can be easily linked and tracked as issue links from the incident's Jira issue.

At Atlassian, we use Jira custom fields to track every incident's start-of-impact time, detection time, and end-of-impact time. We use these fields to calculate time-to-recovery (TTR) which is the interval between start and end, and time-to-detect (TTD) which is the interval between the start and detect. The distribution of your incident TTD and TTR is often an important business metric.

We send final internal and external communications when the incident is resolved. The internal communications have a recap of the incident's impact and duration, including how many support cases were raised and other important incident dimensions, and clearly state that the incident is resolved and there will be no further communications about it. The external communications are usually brief, telling customers that service has been restored and we will follow up with a postmortem.

To learn how Jira Service Management helps teams execute each step of the response process — from centralizing alerting and organizing incident communication to unifying teams for better collaboration through running incident postmortems for root-cause analysis — follow the button below.