Incident Response Communications
Good incident response doesn't just mean fixing the problem – it means being transparent with customers, too.
AND I NEED THIS... WHY?
Keeping your customers in the loop from investigation to resolution is crucial for maintaining their trust and loyalty. But how do you prepare your team to give customers a positive experience?
Few post-incident reviews (PIRs) include analysis of customer communications. This play does not replace your PIR, but supplements it so you gain a broader picture of what happened (from both technical and non-technical teams' POV) and how you can collectively improve for next time.
We created this play to help you put yourself in your customers' shoes to asses the cadence of your current customer communication. Once you have a full picture of incident communication today, you can start to fill any gaps you discover before the next incident strikes. Your customers will thank you.
WHO SHOULD BE INVOLVED?
Major players in the communications process: engineering leads, support, marketing/PR, etc.
4 - 10
Running the play
Run this play once you've had at least one incident that involved a good amount of customer communication. Run it again after your next incident to see how you've improved.
- Whiteboard or butcher's paper
- Sticky notes
Paint a picture of a recent incident (20 min)
Choose a recent incident that lasted for more than two hours, was painful for your customers, and involved at least some customer communications.
Answer the following questions about the incident:
- What time did the incident start? (You may want to refer back to your monitoring/alerting tools)
- What time did the incident end?
- What was the time frame for the investigation stage of this incident?
- What was the time frame for the escalation/decisions/changes stage of this incident?
- What did customers see in your product that let them know there was a problem? (E.g., when attempting to login, customers were sent to a 'something went wrong' error page that pointed them to our support portal). If you have a screenshot of this, have it handy on your laptop to show during the play or print it out.
- What communications were sent out during/after the incident?
- What medium were the comms sent through? (This could be social media channels, a service desk, a status page, email, carrier pigeons, etc. You get the idea.)
- Who sent each out?
- What time(s) were each communication sent at?
Create a timeline (10 mins)
Arrive at the session 5-10 minutes early (make sure you book the room accordingly!). Using either a whiteboard or butcher paper, draw an incident timeline that notes the start time of the incident and the end time of the incident (when it was resolved). You can make it as simple or complex as you'd like!
initial alert ------------------------------------------------------------------- resolution
Refer back to the questions you answered above and use sticky notes to fill in the timeline with times and channels used for customer comms during the incident.
If a channel was used more than once, draw a branch off of it and indicate each time it was used with a dot.
It's key to have folks from different roles represented (support, dev, etc.) so there are different perspectives coming together in conversation that might not otherwise take place.
Set the stage (10 min)
Start the session by describing the incident you chose to focus on. Present the timeline you created and ask participants to walk through the timeline based on what you drew and what they remember from the incident. Fill in any channels or comms that were left out.
Emphasize that this is a safe space where participants can be open about frustrations, confusion, disagreement, etc.
Mind the gap (10 min)
Note the intervals in your communication timeline.
- Time between start of incident/initial alert and first update: x minutes/hours
- Time between first update and second update: x minutes/hours
- Time between second update and third update: x minutes/hours
- Time between third update (or whatever your update before resolution is) and resolution update: x minutes/hours
Highlight the biggest gaps by circling the area(s) with a marker.
We find it's important to communicate early (as soon as a problem that affects customers is detected) and often (provide some kind of update every 20-30 minutes until resolution). Anything much more than that could be considered a gap for this exercise.
Communication assessment (15 min)
Using sticky notes, ask participants to write down problematic gaps, confusion and similar issues with your customer communications during the incident. If things grind to a halt, use the prompts below.
- What path did we send customers down when they reached the error page? Is this a typical flow for customers when there is an error with your product/system?
- How would you have felt as a customer if you encountered a similar issue?
- Did we communicate early and often with customers? (your gaps from step 2 should answer this) Why or why not?
- How did we determine who would communicate out to customers?
- Did we communicate to the right customers? How do we know?
- Were the mediums/channels used easily accessible/visible for customers? How did they know where to find it?
- Did the comms we sent out reduce the need for customers to contact our support team?
- What do our customers now know about this incident solely from the comms we sent out during/after the incident?
Root cause analysis (15 min)
Have each participant come up and stick their "issue" stickies next to the timeline. Have them briefly described what they wrote and discuss as a group why you think that issue occured. In other words, what was the root cause of that issue with your incident communication process?
- Issue: We did not update customers for over 4 hours during the middle of our incident.
- Root Cause: We didn't have an update from the dev team during those 4 hours so we didn't know what to say to customers.
Make a plan (10 min)
As a group, choose 2-3 high-priority issues to address based on the root cause analysis you just did. Make an action plan, including owners and deadlines.
|Root cause of gap|| |
Recommendation to fill
We didn't know what the problem was yet so we couldn't tell our customers anything meaningful.
Even a vague update like "We are very sorry for any disruption this has caused and we are urgently looking into the problem." or "We are still looking into issue x. We will update you again in 30 minutes." is much better than silence.
Nobody took ownership of communicating information to customers
Clearly define incident communication roles so you are clear on responsibilities next time an incident strikes.
We did not know what to say or how to say it.
Write up templates for common incidents or creating tone/style guide for incident updates that your team can reference for ease and simplicity during the chaos of an incident.
We have not agreed upon, consistent incident communication plan as a team
Use this handy template to create an incident communication plan.
We do not have a chain of command for who handles incident comms like we do for who handles the incident fix.
Use this handy template to determine incident comms roles, responsibilities, and escalations.
Be sure to run a full Health Monitor session or checkpoint with your team to see if you're improving.Find your Health Monitor
Share your action items with your team/organization so everyone is aligned on your incident communication process.
After the next incident occurs, run this play again to see how your new communication process played out in real life. The idea is to continually finesse and improve until you have the most well-oiled comms plan possible.
Want even more Playbook?
Drop your email below to be notified when we add new Health Monitors and plays.
Drop a question or comment on the Atlassian Community site.