Let’s start with something truly scandalous. In fact, it’s probably the most controversial thing ever written about ITSM, particularly Problem Management. I’m giddy with excitement, imagining the Twitter uproar (please use hashtag #ITSMwtf) this is going to cause.
Hide your kids. Pull the blinds. And don’t say I didn’t warn you. Okay, here goes:
Problem management isn’t about finding and fixing problems.
There, I said it. And I stand behind it. In day-to-day operations, it’s easy to get hyper-focused on root cause analysis and forget the much bigger picture. So let’s take a look at some of the most common obstacles that IT teams run into as they work relentlessly to keep all the alarms and sirens from going off at once. You'll walk away with some great tips not only for troubleshooting problems, but for preventing them altogether.
Problem #1: Falling into the reactive, root-cause trap
When incoming tickets are bombarding you all day long on the front lines of IT, it’s common to fall into an autopilot “find it and fix it” mode. In fact, many standard service desk metrics encourage agents to resolve as many issues as possible, and rightfully so.
So what’s my beef with root cause analysis? Nothing, except that it’s only a fraction of the true responsibility (and opportunity to add value back to the business) of the Problem Management process. As opposed to just reacting to problems, the true purpose of Problem Management is and always will be to prevent recurrence of incidents, so that IT service can be continuous and problem-free.
To do this, we recommend broadening the definition of Problem Management in your organization. Root cause analysis is part of the picture, but here is the full scope of what your Problem Management practice should be held accountable for:
- Preventing recurring incidents, and the service disruptions they can cause
- Keeping the impact of incidents to a minimum when they can’t be prevented altogether
- Updating information about problems and workarounds religiously, and ensuring that agents know where to find it and how to use it.
- Making sure the right processes are followed at every step.
To be fair, this is completely consistent with many best practices frameworks such as ITIL. In practice, though, high ticket volume and limited resources can make it easy to overlook critical functions like updating the knowledge base, or predictive monitoring that can cut the likelihood and severity of future outages by a significant percentage. Don’t fall into that trap! You simply can’t afford to.
Problem #2: Failing to share your problems.
No, I’m not asking you to read a self-help book (nothing in this article should be seen as a substitute for clinical therapy, if needed). Instead, I’m calling out another common misconception in the practice of Problem Management: that it is a single role, or the responsibility of a small subset of your service desk team.
Oh, the shame. Yes, Problem Manager is a role. And although ITIL’s responsibility matrix makes it look like the Problem Manager is both accountable for the process and responsible for doing the work at every step of the way, you’ll note that applications and technical analysts step in and do a ton of heavy lifting throughout the entire problem diagnosis and resolution phase.
Our stance? It takes a village, friends. As an IT team, we collectively reject the idea of a single root cause. Usually, several distinct failures lead up to a problem, so we encourage our specialized teams (networks, infrastructure, virtual hosting, etc.) to attack the challenge from multiple angles. Together, they work as an agile problem response team, uncovering and exploring a variety of theories or potential avenues for resolution.
On the surface, it might seem a bit luxurious or resource intensive. But time and time again, we’ve solved complex incidents that turned out to be culminations of many distinct failures. Without these collaborative work streams, it would have taken us days (instead of hours) to uncover the complexity behind the real root cause.
Our recommendations? First, if you don’t have the resources you need to deploy your own agile response teams during the diagnosis and resolution phase, you’ll need to assert yourself until you do. Trust me: the implications (in service disruption and lost productivity) of not having these resources far outweigh the costs of being prepared.
Second, don’t fence people in. Encourage an open, interactive environment where service desk agents help, mentor, and encourage each other. Trust your highly specialized experts, but ask for contributions and perspective from even your most junior analysts, too. The job is to prevent problems, and minimize the impact of those you can’t prevent altogether. And you’ll get there faster if you your team works together.
Also, don’t underestimate the importance of having the right tools for the job. Atlassian-made or otherwise, you need strong collaboration tools for chatting and sharing knowledge, tracking processes, and auditing your performance.
Problem #3: Asking far too few questions.
I have a bone to pick with a few mechanics I’ve taken my car to recently. Here’s why. While I describe the symptom, they pretend to listen and nod their heads, and then hook my car up to a machine that tells them nothing is wrong. I pay for the “check up,” and two days later, I’m stranded on the side of the road.
At some point, it seems that they forgot how to think and ask questions – most importantly, “why?”
Ironically, it was the auto industry that developed one of the best techniques ever to determining the root cause of a problem. It’s called “The 5 Whys,” and it was pioneered by Sakichi Toyoda and used extensively at the Toyota Motor Corporation.
It’s a simple, brilliant methodology that works just as well in IT as it does in manufacturing. The best way to explain “Five Why’s” is with an example. I stole this one from Wikipedia:
First, state the problem:The vehicle will not start.
Why? - The battery is dead. (First why)
Why? - The alternator is not functioning. (Second why)
Why? - The alternator belt has broken. (Third why)
Why? - The alternator belt was well beyond its useful service life and not replaced. (Fourth why)
Why? - The vehicle was not maintained according to the recommended service schedule. (Fifth why, a root cause)
Unlike in Six Degrees of Kevin Bacon, it’s okay to take more than the prescribed number of steps to get to the answer. If you need seven “whys” to get to the root cause, use seven. Five is just a generally sufficient number that sounds nice for marketing purposes. The point is to take a logical, stepwise approach that encourages troubleshooters to set aside their assumptions and carefully trace the possible causes until they arrive at the root problem.
In fact, we recommend using “Five Whys” as an early exercise within your agile problem response teams, to help identify some of the possible angles you will approach the problem from. Many times, the answers to each “why” can reveal one or more hypotheses that are worth testing and exploring.
Problem #4 Not spreading the knowledge.
And finally, problem #4. You could also call this “Not closing the deal,” because it speaks directly to the tendency of problem management teams to “resolve and run.”
In your own life, when you learn something new or solve a tough problem, you can commit the result to memory so you benefit from it again. This generally works really well unless you are a teenager, in which case, all bets are off.
In a team environment, though, your own memory is the least beneficial place to retain the knowledge you learn. At minimum, it shouldn’t be the only place. Which is exactly where a knowledge base comes in. It’s a centralized place to store and search for articles that can aid in the problem-solving process.
I could write an entire blog post singing high praises for knowledge-centered support, but as luck would have it, Sarah Zorah already did. She tells you why knowledge management should be at the heart of your service desk, and how to get started.
My favorite point from her post: knowledge-based support is not something to do in addition to solving issues. It’s actually the way in which you resolve issues. Writing or updating knowledge base articles, then, isn’t a burdensome extra step preventing you from moving on to the next problem: It’s the most critical step to preventing (or minimizing the impact of) future ones.
If maintaining and updating knowledge base isn’t already a central part of your service desk process, or you’re just missing the right software to make it happen, drop everything and close this gap today. It’s never too late to get started — but you are losing valuable knowledge (and increasing exposure to the business) every extra day you wait.
My conclusion today is far less scandalous than my intro, I’m afraid. Problem Management, like every discipline of ITSM, is a practice, which means you won’t be inherently perfect at it from the start. By simply looking at it as the sum of its parts – with an eye toward preventing problems, not just troubleshooting them – you’ll be building a much stronger, more sustainable service desk, which leads to happier customers and more profitable business, too. And I see absolutely no problems with that.