What is problem management?
Problem management is the process of identifying and managing the causes of incidents on an IT service. It is a core component of ITSM frameworks.
The closer you get to real incident experts, the less you actually hear the question: “What caused the incident?” Sure, you’ll hear it plenty from executives, and customers, and the press. But the experts know better.
Because the answer to “what caused the incident” is often dry and non-helpful: a rewritten config file, a corrupted database entry.
But what were the contributing causes behind the thing that caused the incident? What were the factors that led up to the incident? How is it possible that a config file could be rewritten? What conditions create a corrupted database entry? These are the questions you hear experts ask. And they’re at the heart of problem management.
Problem management isn’t just about finding and fixing incidents, but identifying and understanding the underlying causes of an incident as well as identifying the best method to eliminate that root cause. Moreover, pinpointing the cause has no value to an organization if it’s a cut-off process completed by a siloed team, so problem management should be constant and widely practiced across multiple teams, including IT, security, and software developers. An incident may be over once the service is up and running again, but until the underlying causes and contributing factors are addressed, the problem remains.
What are the benefits of problem management?
Done right, problem management unleashes many benefits for the business.
Decrease time to resolution
Teams that unlock the problems behind today’s incident will be better prepared to attack incidents in the future. By codifying best practices around problem analysis, teams will be able to more quickly respond and take action during the next service disruption.
Avoid costly incidents
Avoiding incidents will save time, money, and lots of pain. According to Gartner, many organizations report downtime costing more than $300,000 per hour. For some web-based services, that number can be dramatically higher.
Stop responding to incidents so frequently and return resources and time to teams who could be shipping new value to customers.
Empower your team to find and learn from underlying causes
When organizations effectively practice problem management, teams continually investigate, learn from incidents, and ship valuable updates. Unfortunately, many enterprises create a siloed problem management team that is too far removed from day-to-day operations to eliminate the most pressing problems.
Promote continuous service improvement
Problem management prevents incidents and also delivers value. For instance, fixing an incident causing low level performance also ships valuable service quality improvements.
Increase customer satisfaction
Better problem management leads to fewer incidents, and happier customers. Alternatively, customer patience wears thin when they notice the same incident happening multiple times. Decreasing the occurrence of repeat incidents builds customer trust.
The problem management process
At Atlassian, we advocate bringing the problem and incident management processes closer together.
When problem management is a heavy, siloed, and separate process, companies can end up creating a dumping ground of problems. This backlog is where problem issues go to die in some teams. It’s best to get problems in front of the teams that can handle and do valuable investigations.
That all being said, it’s good to understand the main steps that contribute to a problem management process. Such as:
- Problem detection - Proactively find problems so they can be fixed, or identify workarounds before future incidents happen.
- Categorization and prioritization - Track and assess known problems to keep teams organized and working on the most relevant and high-value problems.
- Investigation and diagnosis - Identify the underlying contributing causes of the problem and the best course of action for remediation.
- Create a known error record - In ITIL, a known error is “a problem that has a documented root cause and a workaround.” Recording this information leads to less downtime if the problem triggers an incident. This is typically stored in a document called a known error database.
- Create a workaround, if necessary - A workaround is a temporary solution for reducing the impact of problems and keeping them from becoming incidents. These aren’t ideal, but they can limit business impact and avoid a customer-facing incident if the problem can’t be easily identified and eliminated.
- Resolve and close the problem - A closed problem is one that has been eliminated and can no longer cause another incident.