Update: Check out the new Atlassian Incident Management Handbook to see our full process for responding to incidents.
First, let me get something off my chest: front-line support people are the unsung heroes of every business. Every. Business. I truly believe that tech support should be considered a service industry, and customers should be able to leave tips for agents that deliver excellent service. I would happily leave a ten spot to every killer support person who resolved my issues quickly, and with a smile–if only I could.
But I digress. If you’re reading this, you probably manage or serve on a small to mid-sized help desk team. Your hair is probably also literally on fire right now. It really burns. The smell is awful, too. So let’s do something about that!
In this series, I'll tackle incident management in two parts–basic principles, advanced tips, and technology how-to’s–all in the name of helping you adopt a better incident management process quickly and painlessly. We’ll break it down this way:
- Part one: Nail the process
- Part two: Five Expert tips to manage incidents
Before we dive into process, though, let’s get some basic terminology out of the way.
Incidents are just unplanned events of any kind that disrupt or reduce the quality of service (or threaten to do so). A business application going down is an incident. A crawling-but-not-yet-dead web server can be an incident, too. It’s running slowly, and interfering with productivity. Worse yet, it poses the even greater risk of complete failure.
A problem is just the not-yet-known root cause behind one or more incidents. In the incidents above where the printer is down and the network is creeping, a misconfigured router could be the underlying problem behind both.
How to handle incidents (and how not to)
I’m using ITIL® to walk you through a high-level overview of proper ticket handling, but most other popular frameworks spell out roughly similar concepts through slightly different lingo.
If you are new to ITIL®, no worries. It’s just a flexible set of guidelines you can follow as you build your own process, based on a ton of real world experience. COBIT, TOGAF, or any other major IT framework can get you to a similar place. I recommend familiarizing yourself with the basic concepts in the beginning to give you an early advantage. ("The more you knowwww...")
The key to incident management is having a process–a good one–and sticking to it.
Even that can seem daunting, I know. But the good news is that you can learn from thousands of other service desk teams' experiences. One of the top mistakes of busy, growing IT organizations is to try to reinvent the wheel and create processes from scratch (without drawing on best practices), or build their own homegrown tools for fielding tickets. The very act of reading this drastically reduces your risk of falling victim to Not Invented Here Syndrome. Nice one.
Anyway, let's get down to it.
Identify an incident and log it
An incident can come from anywhere. An employee can call you to report it, or it can literally fall through the ceiling tile and land in your lap, in the case of an ill-placed network hub and a leaky roof. (Not that I'm speaking from experience... *ahem*) No matter the source, the first two steps are simple: someone identifies an incident, then someone logs it. If you receive the incident already logged via your self-service help desk, these first two steps are already done for you. If you get a phone call or the incident is reported via email or text or courier pigeon, it’s the service desk team’s job to properly log it in your help desk system. These incident logs (i.e., tickets) typically include:
- The name of the person reporting the incident
- The date and time the incident is reported
- A description of the incident (what is down or not working properly)
- A unique identification number assigned to the incident, for tracking
Not logging incidents yet? Stop reading here, sign up for any basic service desk solution online (may I suggest Jira Service Desk?), and then resume reading. Go ahead. I'll wait.
These next two steps–categorize and prioritize–are both critical and commonly overlooked. They also separate the more “sane” service desks I’ve worked with from the... well, not so much.
First, you must assign a logical, intuitive category (and subcategory, as needed) to every incident. If you don’t, you are cutting off your ability to later analyze your data and look for trends and patterns, which is a critical part of effective problem management and preventing future incidents. So basically, just don’t forget. And don’t settle for an IT service desk solution that doesn’t allow you to easily customize incident categories.
Second, every incident must be prioritized. To prioritize an incident, start by assessing its impact on the business. Consider both the number of people that will be impacted, as well as the potential financial, security, and compliance implications of the incident to determine how much pain the incident is causing and how urgent a resolution is to the business.
Then, address all open incidents in order of prioritization. Most organizations set clear service agreements around each level of priority, so customers know how quickly to expect a response and resolution. I highly recommend that practice.
Incident response is a pretty broad term, so let’s break it down a bit further into the most likely steps you'll perform once you have identified, categorized, and prioritized an incident.
Think of this as the triage function that a hospital performs on new patients. The service desk employee is formulating a quick hypothesis around what is likely wrong, so they can either set about fixing it or follow the appropriate procedures and compile the right resources to get it resolved. Knowledge bases and diagnostic manuals are helpful tools at this step, too. If the first-level service desk agent is able to resolve the incident based on his or her own initial diagnoses and available knowledge and tools, the incident is resolved. Else, it’s time to escalate.
Escalation sounds like a bad word, but it’s not. Your front-line support team should be able to resolve a large number of the most frequent incidents without escalating. But for those they can’t, the goal is to gather and log the right information to help second and third-level (more technical) support get up to speed quickly, so they can resolve the incident promptly.
Investigation and diagnosis
ITIL® calls this out as it’s own single step. In reality, it happens throughout the incident lifecycle. Your front line support person is already investigating, to an extent, when he or she collects information, and may even successfully diagnose and even resolve the incident without any escalation required. In that case, you’ve skipped directly to the next few steps: resolution and recovery, and incident closure. Otherwise, investigation and diagnosis will happen at every step of the way as you escalate to level 2 and 3 support, or bring outside resources or other department members in to consult and assist with the resolution.
Resolution and recovery
Eventually–and, ideally, within your established service level agreements (SLAs)–you will arrive at a diagnosis and perform the necessary steps to resolve the incident. Recovery simply implies the amount of time it may take for operations to be fully restored, since some fixes (like bug patches, etc.) may require testing and deployment even after the proper resolution has been identified.
The incident is then passed back to the service desk (if it was escalated) to be closed. To maintain quality and ensure a smooth process, only service desk employees are allowed to close incidents, and the incident owner should check with the person who reported the incident to confirm that the resolution is satisfactory and the incident can, in fact, be closed.
Conclusion: don’t skip steps
The process may seem unnecessarily formal, particularly if you only have a few service desk analysts. Regardless of your team structure, though, the incident lifecycle is still the same. Let’s say you only have one service desk analyst, so there is no level three support. But incidents that surpass the knowledge of your service desk analyst have to go somewhere, whether it’s to your chief engineer or an outside consultant or even you, right? Voila! You do indeed have level two or level three support–it’s just you, or your engineer.
My point? Even though ITIL can seem to be all about semantics, don’t get caught up in them. Look for easy ways to adapt your organizational hierarchy and process workflows to fit with an easy IT service management framework like I outlined above. By doing so, you will deliver far better customer service, and deliver much more value back to the business. Plus, your hair will stop burning much faster (bonus points!).
And finally, a few reminders: Log every incident. Give it a unique number. And capture important details (like date, time, and description) in a central help desk system like Jira Service Desk. If you have a large internal or external audience to communicate incident updates to, consider a status page for incident communication. Assign every incident a category (and subcategory, as needed). Give every incident a priority level, and every priority level an SLA. Whenever possible, enable your front line support team with knowledge base articles and incident diagnostic scripts to help them resolve incidents quickly. Make sure the service desk always retains control of incident progress, routing, and status. Don’t just capture incident data. Analyze it! Look for trends, patterns, and potential underlying problems that can reduce incident volume and mitigate risk. Coming up in part two, I'll walk through some expert tips for even better incident handling. See you on the next page!