How to run a major incident management process
Managing and resolving high-impact incidents
Major incident management (often known here at Atlassian simply as incident management) is the process used by DevOps and IT Operations teams to respond to an unplanned event or service interruption and restore the service to its operational state.
What is a major incident?
So, what constitutes a major incident? A major incident is an emergency-level outage or loss of service.
The definition of emergency-level varies across organizations. At Atlassian, we have three severity levels and the top two (SEV 1 and SEV 2) are both considered major incidents.
If a customer-facing service is down for all Atlassian customers, that’s a SEV 1 incident. If the same service is down for a sub-set of customers, that’s SEV 2. Both fall under the heading of major incident and require an immediate response from our incident management teams.
Any issue that does not interfere with essential tasks is considered a SEV 3 and is not a major incident.
Defining your major incident management process
The incident lifecycle (also sometimes known as the incident management process) is the path we take to identify, resolve, understand, and avoid repeating incidents.
Incident management processes vary from company to company, but the key to success for any team is clearly defining and communicating severity levels, priorities, roles, and processes up front — before a major incident arises.
To gain a shared understanding of priorities, roles, and processes, any team that’s starting or revisiting their major incident management process should start by getting clear on the answers to questions like:
- What constitutes a major incident for our company/product?
- How will we define severity and priority levels of incidents? If more than one major incident happens at one time, how will we know what to tackle first?
- Who is responsible for handling major incidents? What roles will team members have? How will those roles be defined and communicated?
- What process will teams follow in the event of a major incident? Is there more than one process, depending on the type of incident?
- How often will we communicate with stakeholders--both internal and external? What is our communication plan?
- What will our on-call schedule look like for major incidents? Who is responsible for an incident at 2 a.m.? On a weekend? Over the holidays?
- When and how should we alert our on-call incident manager--prioritizing quick resolution for major incidents while also avoiding alert fatigue?
Atlassian’s major incident management process
At Atlassian, our incident management process includes detection, raising a new incident, opening comms, assessing, sending initial comms, escalation, delegation, sending follow-up comms, review, and resolution.
First, an incident is detected either by our technology, customer reports, or personnel. Whoever detects the incident (be it a technician who notices the issue or a customer service rep who gets a call from a frustrated client) is responsible for logging the incident in our system and identifying a severity level.
By the time an incident reaches our teams, it’s already got a SEV 1, 2, or 3 attached. We consider SEV levels 1 and 2 to be major incidents, while a SEV 3 indicates a lower-impact incident.
Raising a new incident
Once an incident ticket is created, a notification goes out to the on-call professional responsible for that service.
The page alert we send out at Atlassian includes information on the severity and priority of the incident, as well as a summary, making it clear — at a glance — whether this is the top priority or can wait if another incident is in progress.
Once the incident manager gets an alert, their first order of business is to communicate that the incident fix is in progress. They change the status of the incident to fixing and set up the team’s communication channels.
The incident manager has been alerted and the communication channels are open. Next step: assessing the incident itself.
For our teams, this process starts with a series of questions the team has to answer:
- What’s the impact on Atlassian’s customers and employees?
- What are customers seeing?
- How many customers are affected? (Some? All?)
- When did the incident start?
- How many support cases have been opened about this incident?
- Are there other factors at play that impact the severity level or priority or change the way we approach the incident? (E.g. security concerns, social media PR crises, etc.)
Once we’ve answered those questions, we can confidently move forward with diagnostics and proposed fixes or change the SEV level and priority level of an incident as needed.
Sending initial comms
Once we’ve confirmed that the incident is real, communication with our customers and employees becomes top priority. As we say in our handbook:
“The goal of initial internal communication is to focus the incident response on one place and reduce confusion. The goal of external communication is to tell customers that you know something’s broken and you’re looking into it as a matter of urgency.”
Speedy, accurate communication helps build and keep customer trust.
We have a strategic incident communication plan, use Statuspage to communicate our incidents, and provide regular status updates that follow a simple format. We also send an email to a set list of stakeholders that includes our engineering leadership, major incident managers, and other key internal staff.
Sometimes, an incident is resolved quickly by the on-call team. But in cases where that doesn’t happen, the next step is to escalate the issue to another expert or team of experts better suited to resolve this specific incident.
Once the issue has been escalated to someone new, the incident manager delegates a role to them. At Atlassian, these roles are pre-set, so team members can quickly understand what’s expected of them.
Sometimes major incidents require a single incident manager and a small team. Other times, a situation may call for multiple tech leads or even multiple incident managers. The original incident manager is tasked with figuring out when that’s the case and bringing on the appropriate people.
Sending follow-up comms
As the incident continues to progress, another round of communication outside the tech team will help keep customers and employees calm, trusting, and in the loop.
Unfortunately, when it comes to incident resolution, there’s no one-size-fits-all. Which is why at this stage of the process, we take the time to:
- Observe what’s going on, sharing and confirming observations with the team
- Develop theories about why it’s happening (and how we can fix it)
- Develop and execute experiments that prove or disprove our theories
Throughout this process, the incident manager keeps a close eye on how things are going. Are particular team members overtasked? Does someone need a break? Do we need to bring in a fresh set of eyes? More delegation happens as needed.
Our incident handbook defines resolution as “when the current or imminent business impact has ended.”
At this point, the emergency has passed and the team transitions into clean-ups and postmortems.
Our incident lifecycle ends when the incident is resolved, but that isn’t the end of our process at Atlassian. We also want to do everything in our power to ensure an incident doesn’t repeat. Which is why the next step is a blameless postmortem designed to identify the cause of an incident and help us mitigate our risk in the future.
Roles and responsibilities
Roles and responsibilities will vary based on your organization’s culture, team size, on-call schedules, and more. Some common major incident roles include:
Incident manager: The person responsible for overseeing the resolution of the incident.
Tech lead: A senior-level tech pro tasked with figuring out what’s broken and why, determining the best course of action, and running the tech team.
Communications manager: A communications pro (often from the PR or customer support teams) responsible for communicating with internal and external customers impacted by the incident.
Customer support lead: The person in charge of making sure incoming tickets, phone calls, and tweets about the incident get a timely, appropriate response.
Social media lead: A social media pro in charge of communicating about the incident on social channels.
Other common roles include:
Root cause analyst or problem manager: The person responsible for going beyond the incident’s resolution to identify the root cause and any changes that need to be made to avoid the issue in the future.
Major incident investigation board: A group responsible for investigation and change management.