Incident management for high-velocity teams
Nailing the incident management process like an IT Operations pro
By Nick Wright, Atlassian Service Operations Manager
First, let me get something off my chest: front-line support people are the unsung heroes of every business.
I truly believe that tech support should be considered a service industry, and customers should be able to leave tips for agents who deliver excellent service. I would happily leave a tip for every killer support person who resolved my issues quickly—and with a smile–if only I could.
But I digress. If you’re reading this, you probably manage or serve on a help desk team. Your hair is probably also on fire right now. It really burns. The smell is awful, too. So let’s do something about that—and get your IT incident management process under control.
Before we dive into incident management, though, let’s get on the same page about some common terminology.
ITSM and incident management
If you work in IT, you’re likely familiar with ITIL, ITSM, incidents, and problems. But for the sake of getting everyone on the same page, here are some quick definitions as we use them at Atlassian:
ITIL (IT infrastructure library) is a set of best practices for ITSM (think of it as a playbook).
ITSM (IT service management) is a common approach to creating, supporting, and managing IT services. The core concept of ITSM is the belief that IT should be delivered as a service. And one of the core practices of ITSM is incident management.
Incidents are unplanned events of any kind that disrupt or reduce the quality of service (or threaten to do so). A business application going down is an incident. A crawling-but-not-yet-dead web server can be an incident, too. It’s running slowly and interfering with productivity. Worse yet, it poses the even-greater risk of complete failure.
A problem is the not-yet-known root cause behind one or more incidents. In the incident above where the network is creeping and a business application is down, a misconfigured router could be the underlying problem behind both.
The importance of incident management as an ITSM practice
So, why incident management? Why is this even a part of the ITSM universe?
The answer is in the impact. Research says major incidents cost companies an average of anywhere from $100,000 to $300,000 for every hour a system is down.
Having a well-defined incident management process can help reduce those costs dramatically. Benefits of a well-defined process include:
- Faster incident resolution
- Reduced costs or revenue losses for the organization
- Better communication—both internal and external—during incidents
- Continuous learning and improvement
The incident management process
I’ll be using the ITIL framework to walk you through a high-level overview of proper ticket handling, but most other popular frameworks spell out roughly similar concepts through slightly different lingo.
The key to incident management is having a process–a good one–and sticking to it.
Even that can seem daunting, I know. But the good news is that you can learn from thousands of other IT service teams' experiences.
One of the top mistakes of busy, growing IT organizations is to try to reinvent the wheel and create processes from scratch (without drawing on best practices) or build their own homegrown tools for fielding tickets.
Identify an incident and log it
An incident can come from anywhere. An employee can call you to report it, or it can literally fall through the ceiling tile and land in your lap, in the case of an ill-placed network hub and a leaky roof. (Not that I'm speaking from experience... *ahem*)
No matter the source, the first two steps are simple: someone identifies an incident, then someone logs it.
If you receive the incident already logged via your service desk, these first two steps are already done for you. If you get a phone call or the incident is reported via email or text or courier pigeon, it’s the service desk team’s job to properly log it in your service desk.
These incident logs (i.e., tickets) typically include:
- The name of the person reporting the incident
- The date and time the incident is reported
- A description of the incident (what is down or not working properly)
- A unique identification number assigned to the incident, for tracking
Categorize your incident
These next two steps–categorize and prioritize–are both critical and commonly overlooked. They also separate the more “sane” service desks I’ve worked with from the...well, not so much.
First, you must assign a logical, intuitive category (and subcategory, as needed) to every incident. If you don’t, you’re cutting off your ability to later analyze your data and look for trends and patterns, which is a critical part of effective problem management and preventing future incidents.
So basically, just don’t forget. And don’t settle for an IT service desk solution that doesn’t allow you to easily customize incident categories.
Prioritize your incident
Second, every incident must be prioritized.
To prioritize an incident, start by assessing its impact on the business. Consider both the number of people that will be impacted, as well as the potential financial, security, and compliance implications of the incident to determine how much pain the incident is causing and how urgent a resolution is to the business.
The best practice here is to define your severity and priority levels before an incident happens, making it simpler for incident managers to gauge priority quickly.
And when in doubt about priority? Go with the higher priority level. Better to err on the side of caution than to let something severe fall through the cracks.
Once you’ve set those priorities, address all open incidents in order of prioritization. Most organizations set clear service agreements around each level of priority, so customers know how quickly to expect a response and resolution. I highly recommend that practice.
Incident response is a pretty broad term, so let’s break it down a bit further into the most likely steps you'll perform once you’ve identified, categorized, and prioritized an incident.
Think of this as the triage function that a hospital performs on new patients. The service desk employee is formulating a quick hypothesis around what is likely wrong, so they can either set about fixing it or follow the appropriate procedures and compile the right resources to get it resolved.
Knowledge bases and diagnostic manuals are helpful tools at this step, too.
If the first-level service desk agent is able to resolve the incident based on his or her own initial diagnoses and available knowledge and tools, the incident is resolved. Else, it’s time to escalate.
Escalation sounds like a bad word, but it’s not.
Your front-line support team should be able to resolve a large number of the most frequent incidents without escalating. But for those they can’t, the goal is to gather and log the right information to help second and third-level (more technical) support get up to speed quickly, so they can resolve the incident promptly.
Investigation and diagnosis
ITIL calls this out as its own single step. In reality, it happens throughout the incident lifecycle.
Your front line support person is already investigating, to an extent, when he or she collects information, and may even successfully diagnose and even resolve the incident without any escalation required.
In that case, you’ve skipped directly through the next few steps: resolution and recovery and incident closure.
Otherwise, investigation and diagnosis will happen at every step of the way as you escalate to level 2 and 3 support or bring outside resources or other department members in to consult and assist with the resolution.
Resolution and recovery
Eventually–and, ideally, within your established service level agreements (SLAs)–you will arrive at a diagnosis and perform the necessary steps to resolve the incident. Recovery simply implies the amount of time it may take for operations to be fully restored, since some fixes (like bug patches, etc.) may require testing and deployment even after the proper resolution has been identified.
The incident is then passed back to the service desk (if it was escalated) to be closed. To maintain quality and ensure a smooth process, only service desk employees are allowed to close incidents, and the incident owner should check with the person who reported the incident to confirm that the resolution is satisfactory and the incident can, in fact, be closed.
Conclusion: don’t skip steps
The process may seem unnecessarily formal, particularly if you only have a few service desk analysts. Regardless of your team structure, though, the incident lifecycle is still the same.
Let’s say you only have one service desk analyst, so there is no level-three support. But incidents that surpass the knowledge of your service desk analyst have to go somewhere, whether it’s to your chief engineer or an outside consultant or even you, right?
Voila! You do indeed have level two or level three support—it’s just you or your engineer.
My point? Even though ITIL can seem to be all about semantics, don’t get caught up in them. Look for easy ways to adapt your organizational hierarchy and process workflows to fit with an easy IT service management framework like I outlined above.
By doing so, you will deliver far better customer service, and much more value back to the business. (Plus, your hair will stop burning much faster - bonus points!)
Finally, a few reminders:
- Log every incident. Give it a unique number. And capture important details (like date, time, and description) in a central help desk system.
- If you have a large internal or external audience to communicate incident updates to, consider a status page for incident communication.
- Assign every incident a category (and subcategory, as needed).
- Give every incident a priority level and every priority level an SLA.
- Have clearly defined roles for incident responders, like incident commander, major incident manager, communications lead.
- Whenever possible, enable your front-line support team with knowledge base articles and incident diagnostic scripts to help them resolve incidents quickly.
- Make sure the service desk always retains control of incident progress, routing, and status.
- And don’t just capture incident data. Analyze it! Look for trends, patterns, and potential underlying problems that can reduce incident volume and mitigate risk.
About the author
Service Operations Manager, Atlassian
My team and I make sure Atlassian's cloud applications and infrastructure are performing top notch, and I'm keen to share how we do it while scaling fast. I'm a Kiwi, but despite that linguistic handicap, I can still pronounce Fish and Chips. Outside of work, I'm either cycling, gaming, or hanging out with my wife and lovely little girl.