Incident management for high-velocity teams
The 7 stages of effective incident response
Incident response is an organization’s process of reacting to IT threats such as cyberattack, security breach, and server downtime.
Other IT Ops and DevOps teams may refer to the practice as major incident management or simply incident management.
The following sections describe an incident response process, what to do between realizing a service is down and getting it up and running again, based on the material in our own Incident Handbook.
In this article we’ll cover the seven key stages of incident response:
- Detect the incident
- Set up team communication channels
- Assess the impact and apply a severity level
- Communicate with customers
- Escalate to the right responders
- Delegate incident response roles
- Resolve the incident
Detect the incident
Ideally, monitoring and alerting tools will detect and inform your team about an incident before your customers even notice. Though sometimes you'll first learn about an incident from Twitter or customer support tickets.
No matter how the incident is detected, your first step should be to record that a new incident is open in a tool for tracking incidents. In an incident management solution such as Jira Service Management, alerting and communication is integrated with your tracking tool.
Set up team communication channels
One of the first things the incident manager (IM) does when they come online is set up the incident team's communication channels. The goal at this point is to establish and focus all incident team communications in well-known places, such as:
- Chat room in Slack or another messaging service.
- Video chat in a conferencing app like Zoom (or if you're all in the same place, gather the team in a physical room).
We prefer using both video chat and a text chat tool during incidents, since both excel at different things. Video chat is great for creating a shared mental picture of the incident quickly through group discussion. And Slack helps generate a timestamped record of the incident, along with collected links to screenshots, URLs, and dashboards.
Slack and most other chat tools allow users to set a room topic. The incident manager should use this field for information about the incident and useful links.
Finally, the IM sets their own personal chat status to the issue key of the incident they are managing. This lets their colleagues know that they're busy managing an incident.
Assess the impact and apply a severity level
After the incident team's communication channels are set up, it's time to assess the incident so the team can decide what to tell people about it and who needs to fix it.
We have the following set of questions that IMs ask their teams:
- What is the impact to customers (internal or external)?
- What are customers seeing?
- How many customers are affected (some, all)?
- When did it start?
- How many support cases have customers opened?
- Are there other factors, e.g. Twitter, security, or data loss?
The next step typically is to assign a severity level.
Incident response severity levels
Description: A critical incident with very high impact
- A customer-facing service is for all users
- Confidentiality or privacy is breached
- Customer data loss
A major incident with significant impact
- A customer-facing service is unavailable for some, but not all, customers
- Core functionality is significantly impacted.
A minor incident with low impact
- A minor inconvenience to customers, workaround available.
- Usable performance degradation.
Using a numbering system for severity levels helps quickly define and communicate the incident. All someone has to say is “we might have a sev 1 happening,” and the right people can immediately understand the seriousness of the matter even before getting additional information.
Severity levels can also help build guidelines for response expectations.
At some companies, for example, severity 3 incidents can be addressed during business hours, while severity 1 and 2 require paging team members for an immediate fix.
Incident severity definitions should be documented and consistent throughout the organization.
Communicate with customers
Once a team establishes that the incident is real, it’s best to communicate to stakeholders internally and externally as soon as possible.
The goal of internal communication is to focus the incident response on one place and reduce confusion.
The goal of external communication is to tell customers the team is aware something's broken and you're looking into it. Communicating quickly and accurately helps build trust with customers and the rest of the organization.
Many teams use Statuspage for incident communications both internally and externally. Here are two simple templates for update an internal or external statuspage:
We are investigating an incident affecting
Investigating issues with
We are investigating issues with
Escalate to the right responders
Sometimes the initial responders are the ones who resolve the incident. More often than not, those responders need to bring other teams into the incident by paging them using an alerting tool. With Jira Service Management, responders can take their pick as to what alerting method they use, or even use them all in one central location.
Alerting tools allow teams to define on-call rosters to create a rotation of staff who are expected to be reachable during an incident. This is better than relying on a specific person every time there’s an incident. That same person won't always be available (they take vacations, change jobs, or burn out when you call them too much).
Delegate incident response roles
After a new incident responder is paged and comes online, the incident manager delegates a role to them. As It’s important they understand what's required of their role, and how to contribute to the incident team quickly and effectively.
Another advantage to defining roles is it allows more adaptability and flexibility. As long as a person knows how to perform a certain role, they can take that role for any incident.
Three key incident response roles
Each incident is driven by the incident manager, who has overall responsibility and authority for the incident.
The incident manager has authority to take any action necessary to resolve the incident, including paging anyone in the organization and keeping those involved in an incident focused on restoring service as quickly as possible.
A senior technical responder. The tech lead develops theories about what's broken and why, decides on changes, and runs the technical team. This person works closely with the incident manager.
The person familiar with public communications, possibly from the customer support team or public relations. They are responsible for writing and sending both internal and external communications about the incident.
Resolve the incident
There's no one-size-fits-all process that can resolve every incident. If there were, we'd simply automate that and be done with it. Instead, take inspiration from the scientific method. Iterate on the following process to quickly adapt to a variety of incident response scenarios:
- Observe what's going on. Share and confirm observations.
- Develop theories about why it's happening.
- Develop and execute experiments to prove or disprove your theories.
- Repeat until the incident is resolved.
An incident is resolved when the current or imminent business impact has ended. At that point, the emergency response process ends and the team transitions onto any cleanup tasks and the postmortem.
We send final internal and external communications when the incident is resolved. The internal communications have a recap of the incident's impact and duration, such as how many support cases were raised and other important incident dimensions. It should also clearly state that the incident is resolved and there will be no further communications about it. The external communications are usually brief, telling customers that service has been restored and the team will follow up with a postmortem.
There are many moving parts to the incident response process. Keeping track of each step with seamless communication is easy with an incident management tool like Jira Service Management. Centralize alerts and unify teams with flexibility to resolve incidents quickly.
Setting up an on-call schedule with Opsgenie
In this tutorial, you’ll learn how to set up an on-call schedule, apply override rules, configure on-call notifications, and more, all within Opsgenie.Read this tutorial
Incident response best practices and tips
This collection of incident response best practices and tips will help your team avoid mismanaged incidents, unnecessary delays and associated costs.Read this article