Identify what your team values most during incident response and create a plan to live those values consistently.
AND I NEED THIS... WHY?
It's a fact of life: stuff breaks and some amount of downtime is inevitable for even the most reliable products. It's how you handle incidents and downtime that really matters.
So how do you ensure everyone on your team is aligned on the incident response process? Even the most comprehensive incident response plan lacks instructions for more subjective, nuanced situations that can and will arise when $#*! hits the fan.
We think it's best when teams are armed with the right knowledge and guidance. And we've found that the best form of guidance is a set of mutually agreed upon values. A set of values aligns everyone’s behavior during all phases of an incident: identification, resolution, and retrospective.
So... what will your team value when it matters most?
WHO SHOULD BE INVOLVED?
Representatives from each department involved in incident response: engineering, support, marketing/PR, etc.
5 - 10
Running the play
Your team can run this play either quarterly or biannually to track progress on living out your ideal incident values.
- Whiteboard or butcher's paper
- Sticky notes
Prep (30 mins)
Book a room 30-60 minutes early. Write these 5 basic incident values on a whiteboard or butcher paper.
Draw a long line below to each value that will act as a sliding scale. Number the with 6 notches of equal size, 0-5.
We know there is a problem before our customers do.
Escalate, escalate, escalate (and communicate with customers).
$#!% happens, clean it up quickly.
Never have the same incident twice.
(optional) Copy this Trello template and invite play participants to join the board so they can reference more information about each incident value, take notes during the play, and record goals/action items at the end.
Note that these are the values that Atlassian incident management teams have converged on. Feel free to copy them verbatim. Or, personalize them for your team/organization. This exercise is all about deciding what your team values.
It's key to have folks from different roles represented (support, dev, etc.) so there are different perspectives coming together in conversation that might not otherwise take place.
Set the stage (5 min)
Welcome everyone and establish the rules of engagement:
- Embrace a positive spirit of continuous improvement and share whatever you think will help the team improve.
- Don't make it personal, don't take it personally.
- Listen with an open mind, and remember that everyone's experience is valid (even those you don't share).
Introduce the basic incident values (see above) and describe each one you have written up during prep on your whiteboard/butcher paper.
If these incident values don't quite fit your team, take a few minutes now to add, remove, or adjust them.
Team Value Ratings (10 min)
Hand out 5 sticky notes and a marker to each participant. Spend 2-5 minutes individually rating how well your team lives each value. (0 = "We are far from living this value... 5 = "We nail this value.") Have them write 1 value per sticky and their corresponding rating.
Once everyone is finished, have participants come up and place their ratings on the sliding scales.
Discussions are dominated by one or two people.
This is a sign you may need a stronger facilitator. Find an opportunity to step in and ask what one of your quieter teammates has to say on the topic.
Talk it out (25 min)
The real meat of this play comes from discussing why people rated the team on each value the way that they did. Guide the discussion using the questions below. Feel free to add other questions, too.
- Which values have the most consensus when it comes to ranking? Why do you think that is?
- Which values have the most discrepancy when it comes to ranking? Why do you think that is?
- Any major outliers? Give the people who placed outliers an opportunity to explain the thought behind it.
- What values do you need to work on?
Write down any issues you uncover (i.e. "We never detect incidents before our customers do" or "We usually blame someone during our post incident reviews") on the whiteboard or on a poster board next to your sliders.
Take a picture of your sliding scales so you can remember where you ranked during this exercise and compare if you run this play again in the future.
Boost your values (10 min)
Referring back to the list of issues you uncovered, discuss how you can better live your values. Use questions like these to guide the discussion:
- How can we improve our monitoring and alerting systems so we find out about incidents before our customers do?
- What tools or processes are we lacking right now?
- What is not working about our current escalation process?
- How can we share the burden of middle-of-the-night alerts and escalation work?
- How do we train new members of the team on our escalation process?
- How are we keeping our customers in the loop during an incident?
- Who is ultimately responsible for incident recovery?
- How long does it usually take for us to resolve incidents?
- How do we determine is an incident requires a post-mortem or post-incident review?
- If you were a customer of your service, would you be satisfied with the level of detail we give out about incidents?
- Who is responsible for mitigation post-incident?
- How do you hold each other accountable to making sure the same bug doesn't bite twice?
Action plan (10 min)
Decide as a group on 2-3 changes that will help you live the incident values consistently. Record these on your Trello board or whiteboard and bake the action items into your team to-dos/goals for the month. Start building those muscles before the next incident occurs.
Action items could include things like...
- Name an incident commander so we have a point person to go to when the next incident strikes.
- Create guidelines around when we write a post-mortem and who we share it with.
- Look into consolidating the channels we are using for incident communication.
- Start running post-incident review meetings after each incident that occurs.
Be sure to run a full Health Monitor session or checkpoint with your team to see if you're improving.Find your Health Monitor
LEADERSHIP OR PROJECT TEAMS
Congruent values are important for all teams in an organization. Leadership and Project teams can run this play by adjusting the values to reflect the work and culture you aspire to. Tweak the value names and descriptions to make it relevant for any team at any org.
Share your results and action plan with your broader team/company to stay accountable.
Re-run this play in a few months to see how you did with achieving the goals in your action plan.
Want even more Playbook?
Drop your email below to be notified when we add new Health Monitors and plays.
Drop a question or comment on the Atlassian Community site.