Close
Atlassian Incident Handbook

Incident postmortems

Incident postmortems

We practice blameless postmortems at Atlassian to ensure we understand and remediate the root cause of every incident with a severity of level 2 or higher. Here's a summarized version of our internal documentation describing how we run postmortems at Atlassian.

Overview

Defining incidents and incident values. Know the right tools and team roles.

Responding to an incident

Process for responding and steps to take once an incident is detected.

What is a postmortem?

A postmortem is a written record of an incident that describes:

  • The incident's impact.

  • The actions taken to mitigate or resolve the incident.

  • The incident's root cause.

  • Follow-up actions taken to prevent the incident from happening again.

At Atlassian, we track all postmortems with Jira issues to ensure they are completed and approved. You may decide to use a simpler system, like a Confluence page for each postmortem, if your needs are less complex.

Why do we do postmortems?

The goals of a postmortem are to understand all contributing root causes, document the incident for future reference and pattern discovery, and enact effective preventative actions to reduce the likelihood or impact of recurrence.

For postmortems to be effective at reducing repeat incidents, the review process has to incentivize teams to identify root causes and fix them. The exact method depends on your team culture; at Atlassian, we've found a combination of methods that work for our incident response teams: 

  • Face-to-face meetings help drive appropriate analysis and align the team on what needs fixing.

  • Postmortem approvals by delivery and operations team managers incentivize teams to do them thoroughly.

  • Designated "priority actions" have an agreed Service Level Objective (SLO) which is either 4 or 8 weeks, depending on the service, with reminders and reports to ensure they are completed.

Attending to this process and making sure it is effective requires commitment at all levels in the organization. Our engineering directors and managers decided on the approvers and SLOs for action resolution in their areas. This system just encodes and tries to enforce their decisions.

When is a postmortem needed?

We carry out postmortems for severity 1 and 2 incidents. Otherwise, they're optional.

During or shortly after resolving the issue, the postmortem owner creates a new postmortem issue.

Who completes the postmortem?

The delivery team for the service that failed (the team that owns the "Faulty Service" on the incident issue) is responsible for completing the postmortem. That team selects the postmortem owner and assigns them the postmortem issue.

  • The postmortem owner drives the postmortem through drafting and approval, all the way until it's published. They are accountable for completion of the postmortem. 

  • One or more postmortem approvers review and approve the postmortem, and are expected to prioritize follow-up actions in their backlog.

We have a Confluence page which lists the postmortem approvers (mandatory and optional) by service group, which generally corresponds to an Atlassian product (e.g. Bitbucket Cloud).

How are postmortem actions tracked?

For every action that comes out of the postmortem, we:

  • Raise a Jira issue in the backlog of the team that owns it. All postmortem actions must be tracked in Jira.

  • Link them from the postmortem issue as "Priority Action" (for root cause fixes) or "Improvement Action" (for non-root-cause improvements).

We built some custom reporting using the Jira REST APIs to track how many incidents of each severity have not had their root causes fixed via the priority actions on the postmortem. The engineering managers for each department review this list regularly.

Postmortem process

Running the postmortem process includes creating a postmortem issue, running a postmortem meeting, capturing actions, getting approval and (optionally) communicating the outcome.

The postmortem owner is responsible for running through these tasks:

  1. Create a postmortem and link it to the incident.

  2. Edit the postmortem issue, read the field descriptions and complete the fields.

  3. To determine the root cause of the incident, use the "Five Whys" technique to traverse the causal chain until you find a good true root cause. 

  4. Schedule the postmortem meeting. Invite the delivery team, impacted teams and stakeholders, using the meeting invitation template.

  5. Meet with the team and run through the meeting schedule below.

  6. Follow up with the responsible dev managers to get the commitment to specific actions that will prevent this class of incident.

  7. Raise a Jira issue for each action in the backlogs of the team(s) that own them. Link them from the postmortem issue as "Priority Action" (for root cause fixes) or "Improvement Action" (for other improvements).

  8. Look up the appropriate approvers in Confluence and add them to the "Approvers" field on the postmortem. 

  9. Select the "Request Approval" transition to request approval from the nominated approvers. Automation will comment on the issue with instructions for approvers. 

  10. Follow up as needed until the postmortem is approved.

  11. When the postmortem is approved, we have automation to create a draft postmortem blog in Confluence for you to edit and publish. Blogging postmortems share your hard-earned lessons, which multiplies their value.

Once the postmortem process is done, the actions are prioritized by the development team as part of their normal backlog according to the team's SLO.

Postmortem meetings

We find that gathering the team to discuss learnings together results in deeper analysis into root causes. This is often over video conference due to our distributed teams, and sometimes done in groups where incidents involve large groups of people.

Our suggested agenda:

  1. Remind the team that postmortems are blameless, and why

  2. Confirm the timeline of events

  3. Confirm the root causes

  4. Generate actions using "open thinking" - "What could we do to prevent this class of incident in the future?"

  5. Ask the team "What went well / What could have gone better / Where did we get lucky"

Suggested calendar booking template:

Please join me for a blameless postmortem of <link to incident>, where we <summary of incident>.

The goals of a postmortem are to understand all contributing root causes, document the incident for future reference and pattern discovery, and enact effective preventative actions to reduce the likelihood or impact of recurrence.

In this meeting we'll seek to determine the root causes and decide on actions to mitigate them. 

If you don't have the responsible dev managers in the room, then avoid committing to specific actions in the meeting because it's is a poor context for prioritization decisions. People will feel pressured to commit and don't have complete information. Instead, follow up with the responsible managers after the meeting to get commitment to fix the priority actions identified.

Postmortem issue fields

Our postmortem issue has an extensive series of fields to encourage collecting all the important details about the incident before holding the postmortem meeting. Below are some examples of how we fill out these fields.

Field

Instructions

Example

Incident summary

Summarize the incident in a few sentences. Include what the severity was, why, and how long impact lasted.

Between <time range of incident, e.g. 14:30 and 15:00> on <date><number> customers experienced <event symptoms>. The event was triggered by a deployment at <time of deployment or change that caused the incident>. The deployment contained a code change for <description of or reason for the change>. The bug in this deployment caused <description of the problem>

The event was detected by <system>. We mitigated the event by <resolution actions taken>.

This <severity level> incident affected X% of customers.

<Number of support tickets and/or social media posts> were raised in relation to this incident. 

Leadup

Describe the circumstances that led to this incident, for example, prior changes that introduced latent bugs.

At <time> on <date>, (<amount of time before customer impact>), a change was introduced to <product or service> to ... <description of the changes that led to the incident>. The change caused ... <description of the impact of the changes>

Fault

Describe what didn't work as expected. Attach screenshots of relevant graphs or data showing the fault.

<Number> responses were incorrectly sent to X% of requests over the course of <time period>.

Impact

Describe what internal and external customers saw during the incident. Include how many support cases were raised.

For <length of time> between <time range> on <date>, <incident summary> was experienced.

This affected <number> customers (X% of all <system or service> customers), who encountered <description of symptoms experienced by customers>.

<Number of support tickets and social media posts> were raised.

Detection

How and when did Atlassian detect the incident?

How could time to detection be improved? As a thought exercise, how would you have cut the time in half?

The incident was detected when the <type of alert> was triggered and <team or person paged> were paged. They then had to page <secondary response person or team> because they didn't own the service writing to the disk, delaying the response by <length of time>.

<Description of the improvement> will be set up by <team owning the improvement> so that <impact of improvement>

Response

Who responded, when and how? Were there any delays or barriers to our response?

After being paged at 14:34 UTC, KITT engineer came online at 14:38 in the incident chat room. However, the on-call engineer did not have sufficient background on the Escalator autoscaler, so a further alert was sent at 14:50 and brought a senior KITT engineer into the room at 14:58.

Recovery

Describe how and when service was restored. How did you reach the point where you knew how to mitigate the impact?

Additional questions to ask, depending on the scenario: How could time to mitigation be improved? As a thought exercise, how would you have cut the time in half?

Recovery was a three-pronged response:

  • Increasing the size of the BuildEng EC2 ASG to increase the number of nodes available to service the workload and reduce the likelihood of scheduling on oversubscribed nodes

  • Disabled the Escalator autoscaler to prevent the cluster from aggressively scaling-down

  • Reverting the Build Engineering scheduler to the previous version.

Timeline

Provide a detailed incident timeline, in chronological order, timestamped with timezone(s). 

Include any lead-up; start of impact; detection time; escalations, decisions, and changes; and end of impact.

All times are UTC.

11:48 - K8S 1.9 upgrade of control plane finished 
12:46 - Goliath upgrade to V1.9 completed, including cluster-autoscaler and the BuildEng scheduler instance 
14:20 - Build Engineering reports a problem to the KITT Disturbed
14:27 - KITT Disturbed starts investigating failures of a specific EC2 instance (ip-203-153-8-204) 
14:42 - KITT Disturbed cordons the specific node 
14:49 - BuildEng reports the problem as affecting more than just one node. 86 instances of the problem show failures are more systemic 
15:00 - KITT Disturbed suggests switching to the standard scheduler 
15:34 - BuildEng reports 300 pods failed 
16:00 - BuildEng kills all failed builds with OutOfCpu reports 
16:13 - BuildEng reports the failures are consistently recurring with new builds and were not just transient. 
16:30 - KITT recognize the failures as an incident and run it as an incident. 
16:36 - KITT disable the Escalator autoscaler to prevent the autoscaler from removing compute to alleviate the problem.
16:40 - KITT confirms ASG is stable, cluster load is normal and customer impact resolved.

Five whys

Use the root cause identification technique.

Start with the impact and ask why it happened and why it had the impact it did. Continue asking why until you arrive at the root cause.

Document your "whys" as a list here or in a diagram attached to the issue.

  1. The service went down because the database was locked

  2. Because there were too many databases writes

  3. Because a change was made to the service and the increase was not expected

  4. Because we don't have a development process set up for when we should load test changes

  5. We've never done load testing and are hitting new levels of scale

Root cause

What was the root cause? This is the thing that needs to change in order to stop this class of incident from recurring.

A bug in <cause of bug or service where it occurred> connection pool handling led to leaked connections under failure conditions, combined with lack of visibility into connection state.

Backlog check

Is there anything on your backlog that would have prevented this or greatly reduced its impact? If so, why wasn't it done?

An honest assessment here helps clarify past decisions around priority and risk.

Not specifically. Improvements to flow typing were known ongoing tasks that had rituals in place (e.g. add flow types when you change/create a file). Tickets for fixing up integration tests have been made but haven't been successful when attempted

Recurrence

Has this incident (with the same root cause) occurred before? If so, why did it happen again?

This same root cause resulted in incidents HOT-13432, HOT-14932 and HOT-19452.

Lessons learned

What have we learned?

Discuss what went well, what could have gone better, and where did we get lucky to find improvement opportunities.

  1. Need a unit test to verify the rate-limiter for work has been properly maintained

  2. Bulk operation workloads which are atypical of normal operation should be reviewed

  3. Bulk ops should start slowly and monitored, increasing when service metrics appear nominal

Corrective actions

What are we going to do to make sure this class of incident doesn't happen again? Who will take the actions and by when? 

Create "Priority action" issue links to issues tracking each action. 

  1. Manual auto-scaling rate limit put in place temporarily to limit failures

  2. Unit test and re-introduction of job rate limiting

  3. Introduction of a secondary mechanism to collect distributed rate information across cluster to guide scaling effects

  4. Large migrations need to be coordinated since AWS ES does not autoscale.

  5. Verify Stride search is still classified as Tier-2

  6. File a ticket to against pf-directory-service to partially fail instead of full-fail when the xpsearch-chat-searcher fails.

  7. Cloudwatch alert to identify a high IO problem on the ElasticSearch cluster

 

Proximate and root causes

When you're writing or reading a postmortem, it's necessary to distinguish between the proximate and root causes.

  • Proximate causes are reasons that directly led to this incident.

  • Root causes are reasons at the optimal place in the chain of events where making a change will prevent this entire class of incident.

A postmortem seeks to discover root causes and decide how to best mitigate them. Finding that optimal place in the chain of events is the real art of a postmortem. Use a technique like Five Whys to go "up the chain" and find root causes. 

Here are a few select examples of proximate and root causes:

Scenario Proximate cause & action Root cause Root cause mitigation

Stride "Red Dawn" squad's services did not have Datadog monitors and on-call alerts for their services, or they were not properly configured. 

Team members did not configure monitoring and alerting for new services.

Configure it for these services.

There is no process for standing up new services, which includes monitoring and alerting.

Create a process for standing up new services and teach the team to follow it.

Stride unusable on IE11 due to an upgrade to Fabric Editor that doesn't work on this browser version.

An upgrade of a dependency.

Revert the upgrade.

Lack of cross-browser compatibility testing.

Automate continuous cross-browser compatibility testing.

Logs from Micros EU were not reaching the logging service.

The role provided to micros to send logs with was incorrect.

Correct the role.

We can't tell when logging from an environment isn't working.

Add monitoring and alerting on missing logs for any environment.

Triggered by an earlier AWS incident, Confluence Vertigo nodes exhausted their connection pool to Media, leading to intermittent attachment and media errors for customers.

AWS fault.

Get the AWS postmortem.

A bug in Confluence connection pool handling led to leaked connections under failure conditions, combined with lack of visibility into connection state.

Fix the bug & add monitoring that will detect similar future situations before they have an impact.

 

Root cause categories and their actions

We use these categories to group root causes and discuss the appropriate actions for each.  

Category

Definition

What should you do about it?

Bug

A change to code made by Atlassian (this is a specific type of change)

Test. Canary. Do incremental rollouts and watch them. Use feature flags. Talk to your quality engineer.

Change

A change made by Atlassian (other than to code)

Improve the way you make changes, for example, your change reviews or change management processes. Everything next to "bug" also applies here.

Scale

Failure to scale (eg blind to resource constraints, or lack of capacity planning)

What are your service's resource constraints? Are they monitored and alerted? If you don't have a capacity plan, make one. If you do have one, what new constraint do you need to factor in?

Architecture

Design misalignment with operational conditions

Review your design. Do you need to change platforms?

Dependency

Third party (non-Atlassian) service fault

Are you managing the risk of third party fault? Have we made the business decision to accept a risk, or do we need to build mitigations? See "Root causes with dependencies" below.

Unknown

Indeterminable (action is to increase the ability to diagnose)

Improve your system's observability by adding logging, monitoring, debugging, and similar things.

 

Root causes with dependencies

When your service has an incident because a dependency fails, where the fault lies and what the root cause depends on whether the dependency is internal to Atlassian or 3rd party, and what the reasonable expectation of the dependency's performance is.

If it's an internal dependency, ask "what is the dependency's Service Level Objective (SLO)?":

  • Did the dependency breach their SLO? 
    • The fault lies with the dependency and they need to increase their reliability.

  • Did the dependency stay within their SLO, but your service failed anyway? 
    • Your service needs to increase its resilience.

  • Does the dependency not have an SLO?
    • They need one!

If it's a 3rd party dependency, ask "what is our reasonable expectation* of the 3rd party dependency's availability/latency/etc?"

  • Did the 3rd party dependency exceed our expectation (in a bad way)?

    • Our expectation was incorrect. 

      • Are we confident it won't happen again? E.g. We review and agree with their RCA. In this case, the action is their RCA.

      • Or, do we need to adjust our expectations? In this case, the actions are to increase our resilience and adjust our expectations.

      • Are our adjusted expectations unacceptable? In this case, we need to resolve the disconnect between requirements and solution somehow, eg find another supplier.

  • Did the 3rd party dependency stay within our expectation, but your service failed anyway? 

    • In this case, your service needs to increase its resilience.

  • Do we not really have an expectation?
    • The owner of the 3rd party dependency needs to establish this, and share it with teams so they know what level of resilience they need to build into their dependent services.

*Why "expectation"? Don't we have SLAs with 3rd parties? In reality, contractual SLAs with 3rd parties are too low to be useful in determining fault and mitigation. For example, AWS publishes almost no SLA for EC2. Therefore, when we're depending on a 3rd party service, we have to make a decision about what level of reliability, availability, performance, or another key metric we reasonably expect them to deliver. 

Postmortem actions

Sue Lueder and Betsy Beyer from Google have an excellent presentation and article on postmortem action items, which we use at Atlassian to prompt the team.

Work through the questions below to help ensure the PIR covers both short- and long-term fixes:

"Mitigate future incidents" and "Prevent future incidents" are your most likely source of actions that address the root cause. Be sure to get at least one of these.

Category Question to ask Examples

Investigate this incident

"What happened to cause this incident and why?" Determining the root causes is your ultimate goal.

logs analysis, diagramming the request path, reviewing heap dumps

Mitigate this incident

"What immediate actions did we take to resolve and manage this specific event?"

rolling back, cherry-picking, pushing configs, communicating with affected users

Repair damage from this incident

"How did we resolve immediate or collateral damage from this incident?"

restoring data, fixing machines, removing traffic re-routes

Detect future incidents

"How can we decrease the time to accurately detect a similar failure?"

monitoring, alerting, plausibility checks on input/ output

Mitigate future incidents

"How can we decrease the severity and/or duration of future incidents like this?"

"How can we reduce the percentage of users affected by this class of failure the next time it happens?"

graceful degradation; dropping non-critical results; failing open; augmenting current practices with dashboards or playbooks; incident process changes

Prevent future incidents

"How can we prevent a recurrence of this sort of failure?"

stability improvements in the code base, more thorough unit tests, input validation and robustness to error conditions, provisioning changes

We also use Lueder and Beyer's advice on how to word our postmortem actions.

Wording postmortem actions:

The right wording for a postmortem action can make the difference between an easy completion and indefinite delay due to infeasibility or procrastination. A well-crafted postmortem action should have these properties:

  • Actionable: Phrase each action as a sentence starting with a verb. The action should result in a useful outcome, not a process. For example, “Enumerate the list of critical dependencies” is a good action, while “Investigate dependencies” is not.

  • Specific: Define each action's scope as narrowly as possible, making clear what is and what is not included in the work.

  • Bounded: Word each action to indicate how to tell when it is finished, as opposed to leaving the action open-ended or ongoing.

From... To...

Investigate monitoring for this scenario.

(Actionable) Add alerting for all cases where this service returns >1% errors.

Fix the issue that caused the outage.

(Specific) Handle invalid postal code in user address form input safely.

Make sure engineer checks that database schema can be parsed before updating.

(Bounded) Add automated pre-submit check for schema changes.

Postmortem approvals

Atlassian uses a Jira workflow with an approval step to ensure postmortems are approved. Approvers are generally service owners or other managers with responsibility for the operation of a service. Approval for a postmortem indicates:

  • Agreement with the findings of the post-incident review, including what the root cause was; and

  • Agreement that the linked "Priority Action" actions are an acceptable way to address the root cause.

Our approvers will often request additional actions or identify a certain chain of causation that is not being addressed by the proposed actions. In this way, we see approvals adding a lot of value to our postmortem process at Atlassian.

In teams with fewer incidents or less complex infrastructure, postmortem approvals may not be necessary.

Blameless postmortems

When things go wrong, looking for someone to blame is a natural human tendency. It's in Atlassian's best interests to avoid this, though, so when you're running a postmortem you need to consciously overcome it. We assume good intentions on the part of our staff and never blame people for faults. The postmortem needs to honestly and objectively examine the circumstances that led to the fault so we can find the true root cause(s) and mitigate them. Blaming people jeopardizes this because:

  • When people feel the risk to their standing in the eyes of their peers or to their career prospects, this usually outranks "my employer's corporate best interests" in their personal hierarchy, so they will naturally dissemble or hide the truth in order to protect their basic needs.

  • Even if a person took an action that directly led to an incident, what we should ask is not "why did individual x do this", but "why did the system allow them to do this, or lead them to believe this was the right thing to do".

  • Blaming individuals is unkind and, if repeated often enough, will create a culture of fear and distrust. 

In our postmortems, we use these techniques to create personal safety for all participants:

  • Open the postmortem meeting by stating that this is a blameless postmortem and why 

  • Refer to individuals by role (eg "the on-call Widgets engineer") instead of name (while remaining clear and unambiguous about the facts)

  • Ensure that the postmortem timeline, causal chain, and mitigations are framed in the context of systems, process, and roles, not individuals.

Our inspiration for blameless postmortems and the useful concept of "second stories" comes from John Allspaw's seminal article.

Keep calm and carry on…

You've reached the end of our incident handbook. Thanks for reading!

If you have any feedback or suggestions, please send us an email at incident-handbook@atlassian.com.

Overview

Defining incidents and incident values. Know the right tools and team roles.

Responding to an incident

Process for responding and steps to take once an incident is detected.

Looking for a tool to help run an incident management process?