My daughter’s teacher called me on the phone a few days ago, which is never good. “We had a little incident at school today,” she said. I braced myself. In teacher lingo, “we” always means your kid, and incidents always involve tetanus shots.
It turned out that no other children required medical treatment. But it resonated with me. In IT, we face incidents daily, and our goal is to minimize them — but what exactly does that mean? We hear a lot about incident resolution, but there’s more to it than that. Finding and fixing the incident (and underlying problems) is just part of minimizing the negative impact.
So what else should we be doing? I asked a ton of experts at Atlassian, and here are just a few of our resulting recommendations for better incident management.
Divide and conquer to resolve incidents faster
This tip came straight from Atlassian’s own IT team, and it doesn’t get nearly enough attention from most of the popular service management frameworks. The moment an incident is reported, everyone crafts his or her own hypothesis or an idea for a fix. But how do you pick the best one?
We recommend splitting the investigation into multiple independent streams of work at the earliest opportunity, so you can prove or disprove theories and decide on a course of action quickly.
Jim, one of our IT leads, points out that the idea of a single “root cause” is often a myth, since incidents can be caused by the culmination of many distinct failures. Having several people exploring several paths can speed up the time to resolution, and help you see the full picture.
As an extra tip, make sure each work stream stays closely coordinated with the rest of the team, and particularly, the incident manager. Which leads so naturally (see what I did there?) into our next tip . . .
Assign clear roles, and work together
Even if an incident looks really, really major, always resist the urge to light your hair on fire. It almost never helps, and the smell is generally awful, as I noted in part one of this series. Instead, make sure you have clear roles defined, so you know who is accountable for what.
A strong incident team is comprised of the following roles. It’s not uncommon in smaller organizations for one person to wear several hats–just make sure the key functions are covered and clear accountability is in place.
- Incident manager — Builds the incident team and steers the team through the process.
- Service operations engineer — Performs the initial assessment and implements fixes.
- Subject matter expert(s) — Diagnoses the faults that caused the failure and proposes fixes and workarounds
- Release manager — Ensures that emergency releases of new versions of software products and done quickly and safely
- Internal communications manager — Handles communication with relevant internal staff
- External communications manager — Handles communication with customers (probably the least appealing of all these roles, so make sure you keep a steady stream of donuts and coffee flowing to their desk)
And speaking of external communications... (See? I did it again!)
Alert your customers, not the other way around
During an incident of any nature, it’s natural to focus on fixing the problem, not communicating with hundreds or thousands of customers both inside and outside your organization. But nothing is more frustrating to customers than having something stop working, trying to figure out what is wrong, and eventually even calling to report a suspected outage–only to be met with a “we know, and we’re working on it” after their hour of effort.
Proactively communicating IT incidents shows you care – and that you're in control.
At Atlassian, we recommend:
Putting a monitoring system in place to proactively detect issues, if you don’t already have one.
Assigning internal and external communications managers to major incidents in particular, so it’s clear who is accountable for effective and proactive customer communications.
Establishing a dedicated channel for publishing or broadcasting known issues or outages–and even calling your top customers proactively for outages that affect them. Check out our very own Atlassian Cloud System Status as one example. We also publish service status pages for BitBucket, Hipchat, and just about all of our other cloud services.
Keep track of what counts
Most of you reading this are already using some form of service desk software, even if it’s homegrown. Whether you are using Jira Service Desk or not, it’s critical that you aren’t just using free-form data entry fields to capture the details of each ticket.
We recommend using intuitive, meaningful categories to classify every incident, so you can perform regular analysis and look for patterns that may signal something much larger. Be careful of category overload, though–it’s easy to get carried away.
Most importantly, remember: incident management is not the end-game
Despite all the emphasis on managing incidents and restoring service, that's not your final destination. Across the business, the true goal is to become extremely agile by reflecting or learning from past incidents, preventing problems altogether, and dedicating people and resources to fixing technical debt.
And finally, a few critical reminders that don’t need as much space:
Pace yourself and your team. Yes, you should work swiftly. No, you should not self-induce cardiac arrest from the stress. Consider every decision, re-think as needed, and ask someone you trust to validate.
Have a plan for measuring your fix. What are the expected results or outputs? Define what a successful fix will accomplish upfront, so you know if when you’ve succeeded (or not.)
Be bold. Indecision can lead to paralysis. Rely on the skills of your team and trust your judgment. And finally, don’t mistake being prepared to make a decision with being a carelessly smug jerk. Remember that, and you’ll do fine.