Incident communication best practices
Keep the right people informed during downtime
Incidents have always been a fact of life for people in IT and Ops. Today, it’s also DevOps and customer support teams getting a crash course in incident communication.
Incident communication is the process of alerting users that a service is experiencing some type of outage or degraded performance. This is especially important for web and software services, where 24/7 availability is expected.
Web scale incident communication is more complex than simply sending a bulk email. There are different audiences to consider. Different thresholds for messaging and response expectations.
Since some downtime is inevitable, it’s best to plan ahead and make sure your team is ready.
This is our guide to incident communication best practices. We’ll cover:
- Why incident communication is important
- How to prep for incident communication
- How incident communication pros handle the task
- Why incident communication doesn’t end after the incident
Incident communication: Who cares?
Your customers care. Your colleagues care. You should care. Poorly handled downtime can be a really bad experience for your customers and your teams, which can affect your bottom line. Some of your customers may worry you have more bad experiences up your sleeves and switch to a competitor. You’ll lose future customers due to lack of trust. Team morale can suffer and lead to lower productivity. And say goodbye to all those juicy word-of-mouth referrals.
Luckily, unplanned downtime doesn’t have to turn into a customer service nightmare. It turns out that if you just keep your customers in the loop by communicating what’s happening and what you’re doing to fix the problem, they’ll understand and have a much less negative reaction to the whole situation.
Prepping for incident communication
Proper preparation prevents poor performance. If it’s a good enough slogan for going into battle, it’s good enough for your incident communication strategy. When you’re in the heat of an incident, you’ll thank yourself that you put time into incident communication.
Define what you consider an incident
Before we can communicate incidents, we need to decide what constitutes an incident. Many web companies rely on a standardized 4-tier severity definition system. Here’s a great guide on severity definitions from our own incident handbook.
Whatever your thresholds are for incident severity, it’s important to make a clear line in the sand (ideally around some sort of measurable metric). If you designate an incident at Sev 1, it’s important for anyone on your team to be able to know exactly what that means.
A severity system is also helpful to eliminate the inherent shades of grey that come with downtime.
No matter what system you settle on, we recommend a zero-tolerance communication plan for any incidents involving security issues or data loss.
Pick your communication solutions, channels, and message templates ahead of time
Professional support teams and site reliability engineers don’t decide on the fly what channels to communicate over. They make a plan ahead of time.
There are five main communication channels for incident communication:
- A dedicated status page
- Embedded status
- Workplace chat tool
- Social media
Dedicated status page
We recommend teams use a dedicated status page as their primary incident communication solution. Whether you build it yourself or go with a hosted solution like Statuspage, it’s important to give your customers and colleagues a clear source of truth during an incident. Statuspage also gives your users an option to subscribe to get updates the moment they’re posted. This takes the support burden off teams who should be heads-down fixing the problem.
At Statuspage, we make it easy to embed status information directly onto any website our customers operate. We know most visitors are likely to check a provider’s home page or support page before looking for a status page. The embedded widget (here's an example) is an easy way of letting those visitors know if an incident is underway. Visitors can also click through on the widget to get to the status page.
Like we just mentioned, a good status page tool will give your audience the option to subscribe to email updates. Even if you’re sending directly from your email tool, as opposed to using a status page to trigger email sends, it’s a good channel for incident communication.
Chat tools like Slack have taken over the workplace in recent years. Many teams set up a dedicated war room for incident communications, or spin up a new room for each incident. Check out our integrations with chat tools here.
Many teams use social channels like Twitter as a means of communication during an incident. It’s good to use this as a piece of your strategy, but not rely on it as your only means of communication.
None of these channels are a silver bullet for incident comms. They all have different strengths, and the real power comes when you layer them together. For example, we post incidents to a status page but also push those updates to Twitter. It’s also embedded inside our web app. These messages then direct the user back to the status page for more details on the incident. We recommend that you identify one as your primary communication vehicle and funnel everyone there from the other channels.
Receiving an SMS message, or text message, is often a more-immediate way to reach someone, and a preference for many people when it comes to critical inbound alerts like a downtime announcement. It’s also a channel where people can be message fatigued very fast and will unsubscribe if they see too many messages that aren’t relevant to them.
Set up templates for incident and outage communication
In the heat of an incident, the last thing you want to worry about is how to wordsmith an incident announcement. Wording the incident the wrong way is a perfect target for non-technical managers who might be looking for any reason to criticize your team’s response process.
Decide on the common language ahead of time, get it approved by your managers, and save it in a template. This makes it easy to plug in the relevant details and fire off an incident the day of.
Here are two of the incident templates we use for our own status page:
- The site is currently experiencing a higher than normal amount of load, and may be causing pages to be slow or unresponsive. We’re investigating the cause and will provide an update as soon as possible.
- Our storage provider for public metrics data is currently experiencing infrastructure issues. Updates will be made available as the situation develops or information is provided to us.
Managing communication like a pro
The lifecycle of an incident will likely include several points of contact. Done well, there’s a familiar three-act structure to an incident: First contact, updates during the incident, resolution and post-mortem.
Part 1: First contact
The initial update is the most important. Everything from what you say, to how and when you say it sets the tone for how your response will be perceived. This is where it really helps to have a template set up ahead of time.
Your goal should be to quickly acknowledge the issue, briefly summarize the known impact, promise further updates and, if you’re able, alleviate any concerns about security or data loss. It's important to acknowledge there's an issue, even if you don't know the exact details yet.
Part 2: Regular updates during the incident
Mid-incident communication is critical.
The SRE teams at Google list Communication Lead as one of the key roles someone should oversee during an incident.
From Google’s book “Site Reliability Engineering” on the role of Communication lead:
This person is the public face of the incident response task force. Their duties most definitely include issuing periodic updates to the incident response team and stakeholders (usually via email), and may extend to tasks such as keeping the incident document accurate and up to date.”
This person will also be in charge of continuing to update the status page or post updates to other channels as the situation evolves. Even an updating saying “We’re still working on the problem, nothing new to report.” is better than saying nothing and leaving your audiences hanging. People left in the dark start to expect the worst.
Part 3: Resolution, post-mortem, what comes next
In 2010, Facebook suffered its largest outage to date. For about 2.5 hours, the social network was unavailable for millions of its then-half-a-billion users.
The timing couldn’t have been worse for the burgeoning tech giant, which was still in the early days of its explosive user growth and still proving to the business world that the service was worth the hype.
When the dust settled, a Facebook engineer posted a 395-word summary to the company’s engineering blog about the incident.
From the blog:
Early today Facebook was down or unreachable for many of you for approximately 2.5 hours. This is the worst outage we’ve had in over four years, and we wanted to first of all apologize for it. We also wanted to provide much more technical detail on what happened and share one big lesson learned.
The outline of the post-mortem is simple:
- Acknowledge the problem, empathize with those affected and apologize
- Explain what went wrong and why
- Explain what was done to fix the incident and what was done to prevent repeat incidents
- Acknowledge, empathize, and apologize once again
There’s no need for flowery language or grandiose claims in communication like this. Keep it simple and direct. For example, from the Facebook blog:
We apologize again for the site outage, and we want you to know that we take the performance and reliability of Facebook very seriously.
Language like this makes it easy for your customers and colleagues to trust that you’re running a level-headed team and keeping your eye on the ball.
The reality of running always-on services is that sometimes, things unexpectedly break. Effectively communicating during downtime can actually build trust with both colleagues and customers. Responding well can make all the difference.