Incident management for high-velocity teams
What is incident management?
Incident management is the process used by development and IT Operations teams to respond to an unplanned event or service interruption and restore the service to its operational state.
At Atlassian, we define an incident as an event that causes disruption to or a reduction in the quality of a service which requires an emergency response. Teams who follow ITIL or ITSM practices may use the term major incident for this instead.
Get our Incident Management Handbook
Download the PDF to learn tips and best practices from Atlassian’s incident management experts.
Incidents are events of any kind that disrupt or reduce the quality of service (or threaten to do so). A business application going down is an incident. A crawling-but-not-yet-dead web server can be an incident, too. It’s running slowly and interfering with productivity. Worse yet, it poses the even-greater risk of complete failure. Incidents can vary widely in severity, ranging from an entire global web service crashing to a small number of users having intermittent errors.
An incident is resolved when the affected service resumes functioning in its intended state. This includes only those tasks required to mitigate impact and restore functionality.
The importance of incident management
Atlassian’s incident management values
Incident management is one of the most critical processes an organization needs to get right. Service outages can be costly to the business and teams need an efficient way to respond to and resolve these issues quickly. Teams need a reliable method to prioritize incidents, get to resolution faster, and offer better service for users.
When teams are facing an incident they need a plan that helps them:
- Respond effectively so they can recover fast.
- Communicate clearly to customers, stakeholders, service owners, and others in the organization.
- Collaborate effectively to solve the issue faster as a team and remove barriers that prevent them from resolving the issue.
- Continuously improve to learn from these outages and apply lessons to improve a service and refine their process for the future.
Want to see how Atlassian handles major incidents? We’ve published our internal incident management handbook. Anyone is welcome to learn from it, adapt it, and use it however they see fit.
Types of incident management processes
Different types of companies tend to gravitate toward different types of incident management processes. No single process is best for all companies, so you’re likely to see various approaches across different companies.
Many teams rely on a more traditional IT-style incident management process, such as those outlined in ITIL certifications. Other teams lean toward a more Site Reliability Engineer- (SRE) or DevOps-style incident management process.
IT incident management process
An incident management process helps IT teams investigate, record, and resolve service interruptions or outages. The ITIL incident management workflow aims to reduce downtime and minimize impact on employee productivity from incidents. Using templates designed to manage incidents, you can create a repeatable incident management workflow, which ensures teams log, diagnose, and resolve incidents—and have a record of their activities.
The ITIL framework is chiefly used by IT teams running services inside businesses. Typically teams take what they need from ITIL—which covers almost every type of incident and issue and process IT teams might face—and leave the rest. ITIL is great when teams need to focus on cultivating a culture of active troubleshooting. The prescribed processes help teams track incidents and actions in a consistent manner, which improves reporting and analysis, and can lead to a healthier service and a more successful team.
Steps in the IT incident management process
Identify an incident and log it
An incident can come from anywhere: an employee, a customer, a vendor, monitoring systems. No matter the source, the first two steps are simple: someone identifies an incident, then someone logs it. These incident logs (i.e., tickets) typically include:
- The name of the person reporting the incident
- The date and time the incident is reported
- A description of the incident (what is down or not working properly)
- A unique identification number assigned to the incident, for tracking
Assign a logical, intuitive category (and subcategory, as needed) to every incident. This helps you analyze your data for trends and patterns, which is a critical part of effective problem management and preventing future incidents.
Every incident must be prioritized. Start by assessing its impact on the business, the number of people who will be impacted, any applicable SLAs, as well as the potential financial, security, and compliance implications of the incident. Compare this incident to all other open incidents to determine its relative priority. As a best practice, define your severity and priority levels before an incident happens, making it simpler for incident managers to gauge priority quickly.
- Initial diagnosis: Ideally, your front-line support team can see an incident through from diagnosis through close, but if they can’t, the next step is to log all the pertinent information and escalate to the next tier team.
- Escalate: The next team takes the logged data and continues with the diagnosis process, and, if this next team can’t diagnose the incident, it escalates to the next team.
- Communicate: The team regularly shares updates with impacted internal and external stakeholders.
- Investigation and diagnosis: This continues on until the nature of the incident is identified. Sometimes teams bring in outside resources or other department members in to consult and assist with the resolution.
- Resolution and recovery: In this step, the team arrives at a diagnosis and performs the necessary steps to resolve the incident. Recovery simply implies the amount of time it may take for operations to be fully restored, since some fixes (like bug patches, etc.) may require testing and deployment even after the proper resolution has been identified.
- Closure: If the incident was escalated, it is finally passed back to the service desk to be closed. To maintain quality and ensure a smooth process, only service desk employees are allowed to close incidents, and the incident owner should check with the person who reported the incident to confirm that the resolution is satisfactory and the incident can, in fact, be closed.
DevOps and SRE incident management process
With a DevOps or SRE approach to incident management, the team that builds the service also runs it—and fixes it if it breaks. This approach has exploded in popularity alongside the growth of always-on cloud services, globally-accessed web applications, microservices, and software as a service.
Increasingly the software you rely on for life and work is not being hosted on a server in the same physical location as you. It’s likely a web-accessed application deployed in a data center for thousands or millions of users around the globe. For teams tasked with running these services, agility and speed are paramount. Any downtime has the potential to affect thousands of organizations, not just one.
An advantage of the “you build it, you run it” approach is that it offers the flexibility agile teams need, but it can also obscure who is responsible for what and when. DevOps teams can be comfortable—and successful—with less structured development processes. But it’s best to standardize on a core set of processes for incident management so there is no question how to respond in the heat of an incident, and so you can track issues and report how they’re resolved.
Three beliefs of DevOps incident management teams
- Take turns being on call: Rather than certain team members specializing in being on call, DevOps teams typically rotate through an on call schedule where all members share the burden of possibly being woken at night to respond to an incident.
- The engineer who built it is the best person to fix it: The central idea of the “you build it, you run it” ethos is that the people most familiar with the service (the builders) are the ones best equipped to fix an outage.
- Build with speed, but practice accountability: When engineers know that they and their teammates are on the hook during outages, there’s added incentive to make sure you’re deploying quality code.
This approach assures fast response times and faster feedback to the teams who need to know how to build a reliable service.
We outline a very DevOps-friendly approach to incident management in our Atlassian Incident Handbook.
Incident management tools
Incident management isn’t done just with a tool, but the right blend of tools, practices, and people. Here are several of the most common tool categories for effective incident management:
- Incident tracking: Every incident should be tracked and documented so you can identify trends and make comparisons over time.
- Chat room: Real-time text communication is key for diagnosing and resolving the incident as a team. And it provides a rich set of data for response analysis later on.
- Video chat: Video chat complements text chat for many incidents, team video chat can help discuss the findings and map out a response strategy.
- Alerting system: A tool such as Jira Service Management integrates with your monitoring system and manages on-call rotations and escalations.
- Documentation tool: A tool such as Confluence can capture incident state documents and postmortems.
- Statuspage: Communicating status with both internal stakeholders and customers through Statuspage helps keep everyone in the loop.
Incident management topics
Want to learn about incident management in Jira Service Management?