Incident response communications
Good incident response doesn't just mean fixing the problem – it means being transparent with customers, too.
USE THIS PLAY TO...
Spot gaps in your incident communication practices.
Create a better communication plan you can use in future incidents.
If you're struggling with Health Monitor, running this play might help.or on your
4 - 10
45 - 60 min
Running the play
Run this play once you've had at least one incident that involved a good amount of customer communication. Run it again after your next incident to see how you've improved.
Whiteboard or butcher's paper
Paint a picture of a recent incident (20 min)
Choose a recent incident that lasted for more than two hours, was painful for your customers, and involved at least some customer communications.
Answer the following questions about the incident:
- What time did the incident start? (You may want to refer back to your monitoring/alerting tools)
- What time did the incident end?
- What was the time frame for the investigation stage of this incident?
- What was the time frame for the escalation/decisions/changes stage of this incident?
- What did customers see in your product that let them know there was a problem? (E.g., when attempting to login, customers were sent to a 'something went wrong' error page that pointed them to our support portal). If you have a screenshot of this, have it handy on your laptop to show during the play or print it out.
- What communications were sent out during/after the incident?
- What medium were the comms sent through? (This could be social media channels, a service desk, a status page, email, carrier pigeons, etc. You get the idea.)
- Who sent each out?
- What time(s) were each communication sent at?
Create a timeline (10 mins)
Arrive at the session 5-10 minutes early (make sure you book the room accordingly!). Using either a whiteboard or butcher paper, draw an incident timeline that notes the start time of the incident and the end time of the incident (when it was resolved). You can make it as simple or complex as you'd like!
initial alert ------------------------------------------------------------------- resolution
Refer back to the questions you answered above and use sticky notes to fill in the timeline with times and channels used for customer comms during the incident.
If a channel was used more than once, draw a branch off of it and indicate each time it was used with a dot.
Your incident response communications timeline will look something like this.
After the session, share your lessons learned with as many peers, colleagues, and friends as possible. Take the LEARNT and tell as many of your peers, colleagues, and friends as possible. Ask them what they learned, too. It's the gift that keeps on giving!
Set the stage (10 min)
Start the session by describing the incident you chose to focus on. Present the timeline you created and ask participants to walk through the timeline based on what you drew and what they remember from the incident. Fill in any channels or comms that were left out.
Emphasize that this is a safe space where participants can be open about frustrations, confusion, disagreement, etc.
Mind the gap (10 min)
Note the intervals in your communication timeline.
- Time between start of incident/initial alert and first update: x minutes/hours
- Time between first update and second update: x minutes/hours
- Time between second update and third update: x minutes/hours
- Time between third update (or whatever your update before resolution is) and resolution update: x minutes/hours
Highlight the biggest gaps by circling the area(s) with a marker.
We find it's important to communicate early (as soon as a problem that affects customers is detected) and often (provide some kind of update every 20-30 minutes until resolution). Anything much more than that could be considered a gap for this exercise.
Communication assessment (15 min)
Using sticky notes, ask participants to write down problematic gaps, confusion and similar issues with your customer communications during the incident. If things grind to a halt, use the prompts below.
- What path did we send customers down when they reached the error page? Is this a typical flow for customers when there is an error with your product/system?
- How would you have felt as a customer if you encountered a similar issue?
- Did we communicate early and often with customers? (your gaps from step 2 should answer this) Why or why not?
- How did we determine who would communicate out to customers?
- Did we communicate to the right customers? How do we know?
- Were the mediums/channels used easily accessible/visible for customers? How did they know where to find it?
- Did the comms we sent out reduce the need for customers to contact our support team?
- What do our customers now know about this incident solely from the comms we sent out during/after the incident?
Root cause analysis (15 min)
Have each participant come up and stick their "issue" stickies next to the timeline. Have them briefly described what they wrote and discuss as a group why you think that issue occured. In other words, what was the root cause of that issue with your incident communication process?
- Issue: We did not update customers for over 4 hours during the middle of our incident.
- Root Cause: We didn't have an update from the dev team during those 4 hours so we didn't know what to say to customers.
Be sure to run a full Health Monitor session or checkpoint with your team to see if you're improving.
Share your action items with your team/organization so everyone is aligned on your incident communication process.
After the next incident occurs, run this play again to see how your new communication process played out in real life. The idea is to continually finesse and improve until you have the most well-oiled comms plan possible.
Want even more Playbook?
Drop your email below to be notified when we add new Health Monitors and plays.
Drop a question or comment on the Atlassian Community site.