Update: Check out the new Atlassian Incident Management Handbook to see our full process for responding to incidents.
If you’ve been following along with the ITSM Bootcamp, you’ll know that we recently covered best practices for Service Request Management. Today, we’re diving into incident management with your IT service desk.
Incident management is one of the most critical IT support processes that an IT organization needs to get right. Service outages can be costly to the business and IT teams need an efficient way to respond to and resolve these issues quickly. According to a 2015 HDI study, incident management remains a top priority for 65% of IT teams around the world.
In the coming weeks, we’ll also cover the rest of the core IT processes. Stay updated on all things ITSM by subscribing to the ITSM bootcamp.
Incident management 101
Here’s how the IT Infrastructure Library (ITIL) defines an incident: “An unplanned interruption that causes, may cause, or reduces the quality of an IT Service.” Incident Management is an IT service management (ITSM) process that aims to restore normal service operation as fast as possible, minimizing the impact on the business and end user. A business application going down is an incident. The printer not working is also an incident.
“A crawling-but-not-yet-dead web server can be an incident, too. It’s running slowly, and interfering with productivity. Worse yet, it poses the even greater risk of complete failure.” – Nick Wright, Service Operations Manager at Atlassian
Incident Management vs. Problem Management: A problem is just the not-yet-known root cause behind one or more incidents. In the incidents above where the printer is down and the network is creeping, a misconfigured router could be the underlying problem behind both. Incident management focuses on short-term solutions (not completing a root cause analysis to identify why an incident occurred) and on doing whatever is necessary to restore the service. We’ll talk about managing re-occurring incidents (underlying problems) in the problem management blog.
Mean Time to Resolution (MTTR)
Mean Time to Resolution (MTTR) measures the average time service teams take to resolve an incident, from when it was initially reported. MTTR is one of the key drivers of customer satisfaction, as users may be either completely down or forced to use workarounds until their incidents have been resolved. Normally, this metric is only used to track time that falls within regular business hours. This is so service teams can use this metric to accurately track the average number of working hours they take to satisfactorily resolve an incident.
Consequently, improving major incident response is one of the number one goals for IT teams, specifically around finding ways to lower MTTR and streamline the process of finding the root cause to prevent future outages. The below diagram outlines what’s included in the MTTR. A Forrester study found that most of the time is spent within the Investigation and Diagnosis phase. In fact it takes 70% of the time because IT teams find it difficult to collaborate and share valuable insights to quickly find an incident resolution.
Incident Management Priorities
So what are the key areas and priorities for incident management for IT teams?
- Respond effectively so they can recover fast to define who is accountable for it.
- Communicate clearly to their stakeholders, both service owners, those within the organization, but ultimately their customers.
- Collaborate effectively to solve the issue faster as a team and remove barriers that prevent them from sharing and collaborating.
- Continuously improve to learn from these outages and apply these lessons to improve a service or even refine the process in the future.
Statuspage: While every team uses different solutions for communication, we recommend a dedicated tool like Statuspage for incident communication. This provides a central source of truth for the current status of an incident as well as a record of past incident communication. Stakeholders can customize how they want to receive Statuspage updates; whether it’s over email, text message, or a ChatOps tool like Hipchat.
Incident Management Process
An incident management process helps service desks investigate, record, and resolve service interruptions or outages. An Information Technology Infrastructure Library (ITIL) incident management workflow aims to reduce downtime and negative impacts. The IT Service Desk template comes with an incident management workflow, which ensures that you log, diagnose, and resolve incidents. We recommend you start with this workflow and adapt it to your business needs. When managed well, incident records can identify missing service requirements, potential improvements and future team member training.
The ITIL incident management process, in brief:
1. Service end users, monitoring systems, or internal IT members report interruptions.
2. The service desk describes and logs the incident. They link together all reports related to the service interruption.
3. The service desk records the date and time, reporter name, and a unique ID for the incident. Jira Service Desk does this automatically.
4. A service desk agent labels the incidents with appropriate categorization. The team uses these categories during post-incident reviews and for reporting.
5. A service desk agent prioritizes the incident based on impact and urgency.
6. The team diagnoses the incident, the services effected, and possible solutions. 7. Agents communicate with incident reporters to help complete this diagnosis.
8. If needed, the service desk team escalates the incident to second-line support representatives. These are the people who works regularly on the effected systems.
9. The service desk resolves the service interruption and verifies that the fix is successful. The resolution is fully documented for future reference.
10. The service desk closes the incident.
Pro tip: Want to learn more about nailing the incident management process? Nick Wright, Service Operations Manager at Atlassian, walks through a detailed example here.
Incident Management with Jira Service Desk
Using request types, you can associate incident reports with an issue type called Incident. This puts the incident record into your recommended incident workflow. The workflow follows the basic process above. You can customize it to adapt to the needs of your business.
Want more expert tips for better incident management? Check out this blog. Want updates on these blogs and more ITSM content? Subscribe to the ITSM Bootcamp and get best practices, tips, and resources delivered directly to your inbox.
Ready to jump straight into Jira Service Desk? Hit the blue button below.