Incidents are a learning opportunity.
A chance to uncover vulnerabilities in your system. An opportunity to mitigate repeat incidents and decrease time to resolution.
An incident postmortem is an excellent framework for learning from incidents and turning problems into progress. It also builds trust with customers, colleagues, and end users (basically the folks affected by the incident) and lets them know your team is working to minimize future incidents and impact.
You never want a serious crisis to go to waste. And what I mean by that is an opportunity to do things that you think you could not do before.
— Rahm Emanuel
Here’s a good definition of incident postmortems from Google’s book on Site Reliability Engineering:
“A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring.”
Here are some of our favorite tips on writing an incident postmortem.
Tip 1: Set a threshold
Incidents in your organization should have clear and measurable severity levels. These severity levels can be used to trigger the post-mortem process. For example, any incident Sev-1 or higher triggers the postmortem process, while the postmortem can be optional for less severe incidents. Consider allowing team leads or management the opportunity to request a postmortem for any incident that doesn’t meet the threshold.
Tip 2: Don’t procrastinate
It’s important to take a break and get some rest after an incident. But don’t delay writing the incident post-mortem. Wait too long and important details might be lost or forgotten. Ideally, it’s drafted immediately after a post-incident review meeting to be held within 24-48 hours of the incident resolving, but not more than five business days.
Tip 3: Assign roles and owners
A post-incident review meeting is where you’ll hash out the details that will be recorded into the incident post-mortem. It’s good to delegate the post-mortem draft to a specific person, ideally someone familiar with the incident.
Tip 4: Work from a template
A template can keep you from leaving out key details. And it’s a great way to build consistency throughout your postmortems. Here’s an example of a template from our friends at PagerDuty.
Tip 5: Include a timeline
Timelines are a very helpful aid in incident documentation. Often it’s the first place your readers’ eyes jump to when trying to quickly size up what happened. Try to be as clear and specific as possible. For example, “11:14 am Pacific Standard Time,” not “around 11.”
Important times to include.
- First alert or ticket
- First comms announcement (internal and/or external)
- Times of status page updates
- Time of any remediation attempts (code rollbacks, etc.)
- Time of resolution
Tip 6: “5 Whys”
The root causes of incidents are often several layers thick. Practice your “5 Whys” skills with our Playbook Play here.
Tip 7: Details, details, details
Skimping on details is a quick path to writing postmortems that are unhelpful and unclear. Add as many details as possible about what happened and what was done during the incident. Instead of “then public comms went out,” say “We sent the initial public comms announcing the incident on our public status page and Twitter account.”
Wherever possible include links and names, links to HOT tickets and status updates, links to incident state documents and monitoring charts. Don’t be afraid to add screenshots of relevant graphics or dashboards, too.
Tip 8: Keep it blameless
Many teams, including teams here at Atlassian, have adopted the tenants of the “blameless” postmortem. Blameless postmortems focus on systems and root causes without naming individuals or casting blame onto people or teams.
From Google’s SRE book:
Blameless postmortems are a tenet of SRE culture. For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without inditing any individual or team for bad or inappropriate behavior.