How teams are adopting ChatOps for incident management
It’s no secret that the better your communication, the better your incident management.
Teams with strong communication and collaboration practices resolve incidents faster while keeping both internal teams and external users happier. They’re also better equipped for postmortems that help identify core problems and prevent future incidents.
Which is why it’s no surprise that ChatOps have become an integral part of many incident management teams’ processes.
As Sean Regan—head of product marketing for Jira and Bitbucket—puts it, ChatOps are conversations put to work:
“ChatOps is a collaboration model that connects people, tools, process, and automation into a transparent workflow. This flow connects the work needed, the work happening, and the work done in a persistent location staffed by the people, bots, and related tools. The transparency tightens the feedback loop, improves information sharing, and enhances team collaboration. Not to mention team culture and cross-training…"
“Chat represents a new way to capture the collective knowledge of a team and use it to drive lasting change in how products are delivered and how people work together. Talking about it doesn’t feel like real change, but once you start working this way, you can’t imagine ever going back to the old way.”
How does ChatOps work in incident management?
In the context of incident management, ChatOps brings the incident workflow into one place to keep teams agile and on the same page.
It centralizes all communication about incidents, incident reports, plans, and progress, keeping everyone up to speed in real time. And it provides a place for DevOps, IT, communications, security, legal, and other relevant teams to collaborate on not only incident resolution, but also future incident prevention and risk mitigation.
Break down information silos during incidents
Everyone has access to the same information
The more siloed your incident conversations, the more chances there are for communication errors that derail project progress. Bringing everyone into a single chat room reduces that risk.
Conversations are in real time
This means everyone who needs to be in the loop and take action—from developers resolving incidents to social media managers reassuring end users—is always up-to-speed without delay.
Less context switching
Without ChatOps, incident management typically happens between a variety of applications and is communicated by email, phone, text, etc. This comes with a lot of context-switching and requires a lot of brainpower to keep track of.
ChatOps streamlines everything—as much as possible—into one place. Alerts come into the chat. Reports come into the chat. Conversations are relegated to the chat. And, so, there’s only one place incident teams have to go to get the latest information.
No he-said-she-said-they-said games of telephone
Anyone who is familiar with the old game of telephone knows that it only takes one or two hand-offs to entirely change a message. ChatOps eliminates this risk. If everyone has access to the same original conversations, the risk of communication errors drops significantly.
A built-in written record for incident postmortems
What went wrong? How long did it take to resolve the incident? What solved the problem in the end? Is the fix something we can automate in future?
These are the kinds of questions you’ll likely be investigating in an incident postmortem. And with a single, time-stamped record of all communications, it’ll be a lot easier to answer them clearly and correctly.
ChatOps best practices for incident management
Connect your alert system with your chat
The more your developers have to jump in and out of chat in order to resolve an incident, the more time you lose on task switching. Which means instead of pushing alerts to email and phone during an incident, pushing them directly into your chat room can help speed up the incident resolution process.
Set intelligent thresholds for your alerts
Alert fatigue is a very real threat, especially in the midst of a major incident. So, when we suggest feeding alerts directly into your chat, we don’t mean every alert.
Which alerts will help your team respond quickly and fully to an incident? Which alerts are just more noise? Which alerts are duplicates?
Ask these questions up front and set intelligent alert thresholds for your chat to keep things streamlined and reduce the risk of teams missing something important due to alert fatigue. A tool like OpsGenie allows you to configure which actions are sent to a chat room and filter alerts based on their properties.
Set up a separate room for each major incident
Teams handling a major incident shouldn’t have to worry about getting bogged down by minor incidents, day-to-day chat, or other incidents that aren’t as high on their priority list. Make sure each major incident has its own dedicated room.
Bring actions into the chat
With a combination like Slack and OpsGenie, incident management chat can be turned into more than just a communication channel. You can enable text commands or buttons directly in chat that execute incident actions such as assigning alerts, taking ownership, adding notes, muting incidents, or even creating new alerts.
Invite multiple teams
From DevOps and IT to communications leads and social media managers to security and legal, there are often multiple teams and roles that need to be in the loop on an incident in real time. Figure out who these teams and roles are ahead of time and bring them into your chat early.
Make sure that your chat is secure and the only people who have access to take actions are those you want to take action.
Save chat transcripts
Once your incident is resolved, it’s time for the postmortem—and ChatOps streamlines the process. A single room where all incident communication happens means that after the incident is over, you have a complete record of all conversations, actions, alerts, and reports—all in one place. You can save this record for future reference, use it to update your incident playbooks, and dig into it during the postmortem to come up with ways to avoid or mitigate the risk of similar incidents in future.