Close

Incident management for high-velocity teams

How to choose incident management tools

Categories, key features, and what to look for

There is no single, one-size-fits-all tool for incident management.

The best-performing incident teams use a collection of the right tools, practices, and people.

Some tools are specific to incident management, others are more general purpose tools your team also uses for other tasks. And some tools might be a totally bespoke experience built upon layers of integrations and customization.

No matter the use case, good incident management tools have a few things in common. The best incident management tools are open, reliable, and adaptable.

Open: In a high-pressure environment like an incident, it’s key that the right people have access to the right tools and information immediately. This not only goes for incident responders, but for company stakeholders who need visibility into response efforts.

Reliable: There are few things worse during incident response than also having your key response tools go down. Utilizing cloud tools, like Slack and Opsgenie, minimizes the risk of an outage on your infrastructure taking down your response tools.

Adaptable: Things like integrations, workflows, add-ons, customization, and APIs all open up the possibilities behind the product. You may want to get started with an out-of-the-box configuration, but as your practices and processes mature, you'll want your tools to be flexible enough to support changing needs.

Before the incident

Monitoring

Monitoring systems let DevOps and IT Ops teams collect, aggregate, and trigger alerts off data coming from thousands of different services in real time. These are critical to providing full visibility into the health of your services and often trigger the first alarm bells during an incident.

Benefits

Monitoring tools give your team constant insight into the health of the infrastructure. Modern monitoring tools also proactively trigger alerts during unexpected activity.

Benefits

Monitoring tools give your team constant insight into the health of the infrastructure. Modern monitoring tools also proactively trigger alerts during unexpected activity.

Features

Monitoring tools give your team constant insight into the health of the infrastructure. Modern monitoring tools also proactively trigger alerts during unexpected activity.

 

 

Feature Set

Questions to ask

24/7 coverage and analytics

Does the tool have visibility into all my servers and infrastructures?

Integrates with alerting tools

Can my team see real time analytics and dashboards and set alerting thresholds?

 

Does the product integrate with my alerting and on-call tool?

Service desk

Service desk software gives customers and employees a place to report incidents and potential incidents.

Benefits

Along with their many other use cases, (service requests, IT help desk) service desks empower your team to quickly learn about incidents from the people who matter most: your users and customers.

Features

 

 

Feature set

Questions to ask

Enable self serve

Can customers quickly file tickeCan customers quickly file tickets through a self-service support portal?

 

Can customers find the help they need with automated knowledge-based suggestions?

Our recommendation: Jira Service Management

Alerting and on call

Prompt and reliable alerting is a critical step in incident response. This is how teams make sure the right people are made aware of an incident.

Benefits

Alerting tools notify designated on-call responders through a sophisticated combination of scheduling, escalation paths, and notifications.

Features

 

 

Feature set

Questions to ask

Works globally

Can I send notifications (SMS, voice, email) to almost anywhere?

Multiple notification methods

Can I send notifications using multiple notification methods like email, SMS, phone, and mobile app push and try them multiple times?

Our recommendation: Opsgenie

During the incident

Leveraging a Configuration Management Database (CMDB) for a faster resolution

Understanding the interdependencies within your infrastructure is key to determining the full impact of the incident and reaching a faster resolution.

Benefits

A CMDB helps you understand the relationships and dependencies within your IT infrastructure. If something goes down, this map lets you rapidly find:

  • Potential causes of the incident. For example, determining which host a service is running on at the click of a button.
  • Trickle-down effects of the incident. For example, discovering other services that are running on the same, troublesome host.

This means you can quickly investigate and communicate all aspects of the incident.

 

 

Feature set

Questions to ask

Multiple channels

How flexible is the CMDB? Can I store any CI or asset?

Integrations

Can I visualize my infrastructure graphically?

 

Can I link CIs/assets with my service desk issues?

 

Can I link CIs/assets to change requests?

Our recommendation: Insight

Insight logo

Respond to incidents faster

Map your infrastructure and its dependencies natively within Jira. Quickly find and resolve the cause of incidents and increase up-time!

Team communication

Clear and reliable communication is undeniably critical during incident management.

Benefits

A solid communication platform helps teams communicate, share observations, links, and screenshots in a way that’s timestamped and preserved. This brings the right information and people together during an incident, and creates a rich record to learn from after the incident.

Features

 

 

Feature set

Questions to ask

Multiple channels

Can my incident response team quickly spin up a dedicated channel for an incident?

Integrations

Can other tools in my incident toolchain post into my team's communication channel?

Our recommendation: Slack (text), Zoom (video)

Customer communication

Customer communication tools help keep customers informed during an incident.

Benefits

There’s no getting around it, incidents are typically a bad experience for your customers. Keeping customers informed builds trust and speeds up response efforts. Communicating with customers lets them know you’re aware of the incident and working on a fix.

Features

 

 

Feature set

Questions to ask

Off of my infrastructure

Will my communication tool be operational and accessible even if my internal infrastructure is down?

Subscribers and notifications

Can customers opt in to get notifications when I post about an incident?

Our recommendation: Statuspage

Incident command center

An incident command center is wherever your canonical record of the incident and its key details live. This could be an incident tool like Opsgenie, or an issue tracking tool like Jira.

Benefits

A command center tool offers one place to get everyone up to speed during and after an incident, listing key details like incident status, associated alerts, updates, and more. It also provides a historical record of the incident and its associated response effort.

Features

 

 

Feature set

Questions to ask

Source of truth

Can team members and stakeholders use this record to locate all the other details of the incident and response activities?

Timeline

Does the tool aggregate a chronological timeline of key events?

 

Can team members and stakeholders quickly get up to speed on the incident?

Our recommendation: Opsgenie

After the incident

Postmortem and analysis

Postmortems are a written record of what happened during the incident and any follow-up actions taken to prevent it from happening again.

Benefits

After an incident is resolved, teams still often don’t know the root causes and are at risk of the same incident happening again. Postmortems help to prevent that by bringing the team together for a post-incident analysis.

Features

 

 

Feature set

Questions to ask

Templates

Can my team use a template to fill out a postmortem?

Map out next actions

Can my team plan out next actions and remediation work during a postmortem?

Our recommendation: Opsgenie

Issue tracking

An issue tracking tool helps the team map out future remediation work that needs to be done.

Benefits

In many cases, resolving the incident brings the service back online without addressing the root cause. Typically there is more engineering work that needs to be done in order to remediate root causes and make sure the incident doesn’t repeat itself. Issue and work tracking tools — which your team is hopefully already using for other development work — help make sure this work is prioritized and doesn’t fall through the cracks.

Features

 

 

Feature set

Questions to ask

Shared workflow pipeline

Can my team plan any incident remediation work alongside their other work and priorities?

Integrations

Can my team pull in data and content from my other incident tools?

Our recommendation: Jira Software

Up Next
KPIs