Close

The 7 stages of effective incident response

Tips for responding during the heat of an incident

When one of your services is on fire, there’s no time to waste.

Especially if that fire is blocking your users or business from getting stuff done. Rapid and efficient incident response involves a series of steps to drive the incident from detection to resolution.

The following sections describe an incident response process, what to do between realizing a service is down and getting it up and running again, based on the material in our own Incident Handbook.

In this article we’ll cover the seven key stages of incident response:

  1. Detect the incident
  2. Set up team communication channels
  3. Assess the impact and apply a severity level
  4. Communicate with customers
  5. Escalate to the right responders
  6. Delegate incident response roles
  7. Resolve the incident
Incident response workflow

Detect the incident

Ideally, monitoring and alerting tools will detect and inform your team about an incident before your customers even notice. Though sometimes you'll first learn about an incident from Twitter or customer support tickets.

No matter how the incident is detected, your first step should be to record that a new incident is open in a tool for tracking incidents. This could be an ops-specific tool like Opsgenie Enterprise, or a broader tracking tool like Jira.

Set up team communication channels

One of the first things the incident manager (IM) does when they come online is set up the incident team's communication channels. The goal at this point is to establish and focus all incident team communications in well-known places, such as: 

  • Chat room in Slack or another messaging service.
  • Video chat in a conferencing app like Zoom (or if you're all in the same place, gather the team in a physical room).

We prefer using both video chat and a text chat tool during incidents, since both excel at different things. Video chat is great for creating a shared mental picture of the incident quickly through group discussion. And Slack helps generate a timestamped record of the incident, along with collected links to screenshots, URLs, and dashboards.

Slack and most other chat tools allow users to set a room topic. The incident manager should use this field for information about the incident and useful links.

Finally, the IM sets their own personal chat status to the issue key of the incident they are managing. This lets their colleagues know that they're busy managing an incident.

Assess the impact and apply a severity level

After the incident team's communication channels are set up, it's time to assess the incident so the team can decide what to tell people about it and who needs to fix it.

We have the following set of questions that IMs ask their teams:

  • What is the impact to customers (internal or external)?
  • What are customers seeing?
  • How many customers are affected (some, all)?
  • When did it start?
  • How many support cases have customers opened?
  • Are there other factors, e.g. Twitter, security, or data loss?

The next step typically is to assign a severity level.

Incident response severity levels

Severity 1
Description: A critical incident with very high impact
Examples:

  • A customer-facing service is for all users
  • Confidentiality or privacy is breached
  • Customer data loss

Severity 2
A major incident with significant impact
Examples:

  • A customer-facing service is unavailable for some, but not all, customers
  • Core functionality is significantly impacted.

Severity 3
A minor incident with low impact
Examples:

  • A minor inconvenience to customers, workaround available.
  • Usable performance degradation.

Using a numbering system for severity levels helps quickly define and communicate the incident. All someone has to say is “we might have a sev 1 happening,” and the right people can immediately understand the seriousness of the matter even before getting additional information.

Severity levels can also help build guidelines for response expectations.

At some companies, for example, severity 3 incidents can be addressed during business hours, while severity 1 and 2 require paging team members for an immediate fix.

Incident severity definitions should be documented and consistent throughout the organization.

Communicate with customers

Once a team establishes that the incident is real, it’s best to communicate to stakeholders internally and externally as soon as possible.

The goal of internal communication is to focus the incident response on one place and reduce confusion.

The goal of external communication is to tell customers the team is aware something's broken and you're looking into it. Communicating quickly and accurately helps build trust with customers and the rest of the organization.

Many teams use Statuspage for incident communications both internally and externally. Here are two simple templates for update an internal or external statuspage:

Internal Statuspage
<Incident issue key> - <Severity> - <Incident summary>

We are investigating an incident affecting <product x>, <product y> and <product z>. We will provide updates via email and Statuspage shortly.

External Statuspage
Investigating issues with <product>

We are investigating issues with <product> and will provide updates here soon.

Escalate to the right responders

Sometimes the initial responders are the ones who resolve the incident. More often than not, those responders need to bring other teams into the incident by paging them using an alerting tool like Opsgenie.

Alerting tools allow teams to define on-call rosters to create a rotation of staff who are expected to be reachable during an incident. This is better than relying on a specific person every time there’s an incident. That same person won't always be available (they take vacations, change jobs, or burn out when you call them too much).

Delegate incident response roles

After a new incident responder is paged and comes online, the incident manager delegates a role to them. As It’s important they understand what's required of their role, and how to contribute to the incident team quickly and effectively.

Another advantage to defining roles is it allows more adaptability and flexibility. As long as a person knows how to perform a certain role, they can take that role for any incident.

Three key incident response response roles

Incident manager

Each incident is driven by the incident manager, who has overall responsibility and authority for the incident.

The incident manager has authority to take any action necessary to resolve the incident, including paging anyone in the organization and keeping those involved in an incident focused on restoring service as quickly as possible.

Tech lead

A senior technical responder. The tech lead develops theories about what's broken and why, decides on changes, and runs the technical team. This person works closely with the incident manager.

Communications manager

The person familiar with public communications, possibly from the customer support team or public relations. They are responsible for writing and sending both internal and external communications about the incident.

Resolve the incident

There's no one-size-fits-all process that can resolve every incident. If there were, we'd simply automate that and be done with it. Instead, take inspiration from the scientific method. Iterate on the following process to quickly adapt to a variety of incident response scenarios:

  • Observe what's going on. Share and confirm observations.
  • Develop theories about why it's happening.
  • Develop and execute experiments to prove or disprove your theories.
  • Repeat until the incident is resolved.

An incident is resolved when the current or imminent business impact has ended. At that point, the emergency response process ends and the team transitions onto any cleanup tasks and the postmortem.

We send final internal and external communications when the incident is resolved. The internal communications have a recap of the incident's impact and duration, such as how many support cases were raised and other important incident dimensions. It should also clearly state that the incident is resolved and there will be no further communications about it. The external communications are usually brief, telling customers that service has been restored and the team will follow up with a postmortem.