Incident management for high-velocity teams
What is incident management?
Incident management is the process used by development and IT Operations teams to respond to an unplanned event or service interruption and restore the service to its operational state.
At Atlassian, we define an incident as an event that causes disruption to or a reduction in the quality of a service which requires an emergency response. Teams who follow ITIL or ITSM practices may use the term major incident for this instead.
Get our Incident Management Handbook
Download the PDF to learn tips and best practices from Atlassian’s incident management experts.
Incidents are events of any kind that disrupt or reduce the quality of service (or threaten to do so). A business application going down is an incident. A crawling-but-not-yet-dead web server can be an incident, too. It’s running slowly and interfering with productivity. Worse yet, it poses the even-greater risk of complete failure. Incidents can vary widely in severity, ranging from an entire global web service crashing to a small number of users having intermittent errors.
An incident is resolved when the affected service resumes functioning in its intended state. This includes only those tasks required to mitigate impact and restore functionality.
Incident management topics
Want to see how Atlassian handles major incidents? We’ve published our internal incident management handbook. Anyone is welcome to learn from it, adapt it, and use it however they see fit.
Setting up an on-call schedule with Opsgenie
In this tutorial, you’ll learn how to set up an on-call schedule, apply override rules, configure on-call notifications, and more, all within Opsgenie.Read this tutorial
Pros and cons of different approaches to on-call management
On call teams are rapidly evolving. Explore the pros and cons of different approaches to on call management.Read this article