Close

Incident management

Learn modern incident management with tutorials, tips, and best practices

What is incident management?

Incident management is the process used by DevOps and IT Operations teams to respond to an unplanned event or service interruption and restore the service to its operational state.

At Atlassian, we define an incident as an event that causes disruption to or a reduction in the quality of a service which requires an emergency response. Teams who follow ITIL or ITSM practices may use the term major incident for this instead.

Incident Management Handbook

Get our Incident Management Handbook in print or PDF

We've got a limited supply of print versions of our Incident Management Handbook that we're shipping out for free. Or download a PDF version.

An incident is resolved when the affected service resumes functioning in its intended state. This includes only those tasks required to mitigate impact and restore functionality.

These types of incidents can vary widely in severity, ranging from an entire global web service crashing to a small number of users having intermittent errors.

Incident management topics

Featured tutorials

[CONTINUED]

The importance of incident management

Incident management values

Atlassian’s incident management values

Incident management is one of the most critical processes an organization needs to get right. Service outages can be costly to the business and teams need an efficient way to respond to and resolve these issues quickly.

Many organizations report downtime costing more than $300,000 per hour, according to Gartner. For some web-based services, that number can be dramatically higher.

Teams need a reliable method to prioritize incidents, get to resolution faster, and offer better service for users.

When teams are facing an incident they need a plan that helps them:

  • Respond effectively so they can recover fast.
  • Communicate clearly to customers, stakeholders, service owners, and others in the organization.
  • Collaborate effectively to solve the issue faster as a team and remove barriers that prevent them from resolving the issue.
  • Continuously improve to learn from these outages and apply lessons to improve a service and refine their process for the future.
     

Want to see how Atlassian handles major incidents? We’ve published our internal incident management handbook. Anyone is welcome to learn from it, adapt it, and use it however they see fit.

Types of incident management processes

Different types of companies tend to gravitate toward different types of incident management processes. No single process is best for all companies, so you’re likely to see various approaches across different companies.

Many teams rely on a more traditional IT-style incident management process, such as those outlined in ITIL certifications. Other teams lean toward a more Site Reliability Engineer- (SRE) or DevOps-style incident management process.

IT incident management process

An incident management process helps IT teams investigate, record, and resolve service interruptions or outages. The ITIL incident management workflow aims to reduce downtime and minimize impact on employee productivity from incidents. Using templates designed to manage incidents, you can create a repeatable incident management workflow, which ensures teams log, diagnose, and resolve incidents—and have a record of their activities.

The ITIL framework is chiefly used by IT teams running services inside businesses. Typically teams take what they need from ITIL—which covers almost every type of incident and issue and process IT teams might face—and leave the rest. ITIL is great when teams need to focus on cultivating a culture of active troubleshooting. The prescribed processes help teams track incidents and actions in a consistent manner, which improves reporting and analysis, and can lead to a healthier service and a more successful team.

Steps in the IT incident management process

Identify an incident and log it

An incident can come from anywhere: an employee, a customer, a vendor, monitoring systems. No matter the source, the first two steps are simple: someone identifies an incident, then someone logs it. These incident logs (i.e., tickets) typically include:

  • The name of the person reporting the incident
  • The date and time the incident is reported
  • A description of the incident (what is down or not working properly)
  • A unique identification number assigned to the incident, for tracking

Categorize

Assign a logical, intuitive category (and subcategory, as needed) to every incident. This helps you analyze your data for trends and patterns, which is a critical part of effective problem management and preventing future incidents.

Prioritize

Every incident must be prioritized. Start by assessing its impact on the business, the number of people who will be impacted, any applicable SLAs, as well as the potential financial, security, and compliance implications of the incident.  Compare this incident to all other open incidents to determine its relative priority.

Respond

  • Initial diagnosis: Ideally, your front-line support team can see an incident through from diagnosis through close, but if they can’t, the next step is to log all the pertinent information and escalate to the next tier team.
  • Escalate: The next team takes the logged data and continues with the diagnosis process, and, if this next team can’t diagnose the incident, it escalates to the next team.
  • Communicate: The team regularly shares updates with impacted internal and external stakeholders.
  • Investigation and diagnosis: This continues on until the nature of the incident is identified. Sometimes teams bring in outside resources or other department members in to consult and assist with the resolution.
  • Resolution and recovery: In this step, the team arrives at a diagnosis and performs the necessary steps to resolve the incident. Recovery simply implies the amount of time it may take for operations to be fully restored, since some fixes (like bug patches, etc.) may require testing and deployment even after the proper resolution has been identified.
  • Closure: If the incident was escalated, it is finally passed back to the service desk to be closed. To maintain quality and ensure a smooth process, only service desk employees are allowed to close incidents, and the incident owner should check with the person who reported the incident to confirm that the resolution is satisfactory and the incident can, in fact, be closed.

Incidents, problems, and changes: What’s the difference?

There are different types of issues IT teams typically encounter, and we classify them so we can apply the appropriate management techniques to them.

  • Service request – A formal request from a customer for something to be provided, e.g. provisioning a new laptop.
  • Incident – An unplanned interruption to an IT service or reduction in the service quality, e.g. the website goes down.
  • Problem – A problem is the underlying, root-cause of an incident, e.g. a bad configuration of a server. These are the things you want to stay on top of so you don’t have full-on incidents.
  • Change – An action that you take, which can either be standard, normal, or an emergency. A standard change has an established procedure. A normal change is often non-trivial and has to go through an approval process. An emergency change is enacted with immediacy, and is, ideally, tested before it’s rolled out.

DevOps and SRE incident management process

With a DevOps or SRE approach to incident management, the team that builds the service also runs it—and fixes it if it breaks. This approach has exploded in popularity alongside the growth of always-on cloud services, globally-accessed web applications, microservices, and software as a service.

Increasingly the software you rely on for life and work is not being hosted on a server in the same physical location as you. It’s likely a web-accessed application deployed in a data center for thousands or millions of users around the globe. For teams tasked with running these services, agility and speed are paramount. And any downtime has the potential to affect thousands of organizations, not just one.

An advantage of the “you build it, you run it” approach is that it offers the flexibility agile teams need, but it can also leave fuzzy who is responsible for what and when. DevOps teams can be comfortable—and successful—with less structured development processes. But it’s best to standardize on a core set of processes for incident management so there is no question how to respond in the heat of an incident, and so you can track issues and report how they’re resolved.

Three beliefs of DevOps incident management teams

  • Take turns being on call: Rather than certain team members specializing in being on call, DevOps teams typically rotate through an on call schedule where all members share the burden of possibly being woken at night to respond to an incident.
  • The engineer who built it is the best person to fix it: The central idea of the “you build it, you run it” ethos is that the people most familiar with the service (the builders) are the ones best equipped to fix an outage.
  • Build with speed, but practice accountability: When engineers know that they and their teammates are on the hook during outages, there’s added incentive to make sure you’re deploying quality code.

This approach assures fast response times and faster feedback to the teams who need to know how to build a reliable service.

We outline a very DevOps-friendly approach to incident management in our Atlassian Incident Handbook.

Incident management tools

Incident management isn’t done just with a tool, but the right blend of tools, practices, and people. Here are several of the most common tool categories for effective incident management:

  • Incident tracking: Every incident should be tracked and documented so you can identify trends and make comparisons over time.
  • Chat room: Real-time text communication is key for diagnosing and resolving the incident as a team. And it provides a rich set of data for response analysis later on.
  • Video chat: Video chat complements text chat for many incidents, team video chat can help discuss the findings and map out a response strategy.
  • Alerting system: A tool such as OpsGenie integrates with your monitoring system and manages on-call rotations and escalations.
  • Documentation tool: A tool such as Confluence can capture incident state documents and postmortems.
  • Statuspage: Communicating status with both internal stakeholders and customers through Statuspage helps keep everyone in the loop.
A continuación
Incident communication