Incident management for high-velocity teams
Public vs. Private incident postmortems
Knowing the right time to share a public post-incident explanation
There was a time when almost every IT incident was confined to within the four walls of the organization where it happened. But today, with web services and cloud infrastructure, that’s rarely the case. Technology incidents are a true “one-to-many” problem, and it’s led to a big shift in the way teams respond, learn, and communicate about incidents.
Consider the incident postmortem (also often called the “post-incident review” or “PIR”).
An incident postmortem brings people together to discuss the details of an incident: why it happened, its impact, what actions were taken to mitigate it and resolve it, and what should be done to prevent it from happening again.
An incident postmortem can be divided into two distinct artifacts: the meeting where the incident is discussed, and the corresponding postmortem report created as an output of that meeting.
These two activities, the meeting and the report, are often used interchangeably when people refer to a “postmortem”. People might be talking about either, or both, when they use the term.
Partners, customers, and end-users may also want to know what happened and what steps you have taken to improve their experience. Making your incident postmortem available on your public-facing website may not be appropriate in all cases, but your marketing or public relations team can help you craft the language so people get the information in a way that is informative and builds trust in your services.
When to do an incident postmortem
At Atlassian, we always conduct internal postmortems for severity 1 and 2 (“major”) incidents. For minor incidents, they’re optional. We encourage people to use the postmortem process for any situation where it would be useful.
Who completes the postmortem?
Usually the team that delivers the service that caused the incident is responsible for completing the associated postmortem. They nominate one person to be accountable for completing the postmortem, and the issue is assigned to them. They are the “postmortem owner” and they drive the postmortem through drafting and approval, all the way until it’s published. Infrastructure and platform-level incidents often impact a cross-section of the company, making their postmortems more complicated and effort-intensive. For this reason we sometimes assign a dedicated program manager to own infrastructure or platform-level postmortems because these staff are better suited to working across groups, and they are able to commit the requisite level of effort.
Sharing an internal postmortem report
Once the postmortem is approved we find we can multiply its value by sharing what we learned with the whole company. To accomplish this, at Atlassian we have an automation action that creates a draft blog post in Confluence when the postmortem ticket is approved.
Creating a public incident postmortem report
While it’s less common, it’s often a good idea to publish a public version of a postmortem after an incident.
This is especially common for large scale consumer services who have outages that impact a lot of users. More often than not, these teams are publishing a trimmed-down version of the internal report, rather than the full internal report. It’s important to clean up any private or sensitive information.
Sharing a public incident postmortem report
It can be tricky to know the right channel to publish a public postmortem. For some teams, it can be right on your company blog or your website. Other teams have a separate engineering blog where a postmortem would fit.