빠른 속도의 팀을 위한 인시던트 관리
How to run a blameless postmortem
Incident postmortems focused on growth – without the blame game
Most companies experience major incidents at least several times per year.
We can work to prevent incidents, reduce their impact, and shorten their timelines. But they’re probably not going to disappear altogether anytime soon.
The good news is that incidents are a learning opportunity. They’re a chance to uncover vulnerabilities in our systems, prevent future recurrences, hone our processes to reduce incident impact, and build better software in the future.
The best way to learn from incidents is to institute incident postmortems. And here at Atlassian, our postmortems are blameless.
What is a blameless postmortem?
An incident postmortem brings teams together to take a deeper look at an incident and figure out what happened, why it happened, how the team responded, and what can be done to prevent repeat incidents and improve future responses.
Blameless postmortems do all this without any blame games.
In a blameless postmortem, it’s assumed that every team and employee acted with the best intentions based on the information they had at the time. Instead of identifying—and punishing—whoever screwed up, blameless postmortems focus on improving performance moving forward.
From Atlassian’s Incident Management Handbook:
When things go wrong, looking for someone to blame is a natural human tendency. It's in Atlassian's best interests to avoid this, though, so when you're running a postmortem you need to consciously overcome it. We assume good intentions on the part of our staff and never blame people for faults. The postmortem needs to honestly and objectively examine the circumstances that led to the fault so we can find the true root cause(s) and mitigate them.
Advocates—like Google and Etsy—say this approach helps foster a culture of learning and improves performance over time. They point out that removing the witch hunt portion of the program creates a psychological shift. Instead of worrying about being fired or demoted and trying to pass around blame like a hot potato, teams can focus on fixing the underlying issues.
Detractors wonder if blameless postmortems are really possible (aren’t humans wired for blame?) and worry the approach doesn’t foster accountability.
Are blameless postmortems even possible?
One of the primary critiques of blameless postmortems is that they simply aren’t possible. After all, blame and judgment are natural. Accountability is an essential part of running a successful team. And detractors imagine that blameless postmortems are like an awkward family dinner – everyone trying semi-successfully to smile and not say what they’re really thinking.
These critiques assume that the point of blameless postmortems is to make those responsible for an incident feel better—a goal that would probably stifle real conversation and accountability.
But the actual point of blameless postmortems is to remove the fear of looking stupid, being reprimanded, or even losing your job with the ultimate goal of encouraging honest, objective and fact-centric communication that leads to better future outcomes.
For example, let’s say an incident happened because Employee A assumed, incorrectly, that Employee B had deployed a fix. Instead of spending the postmortem trying to figure out whether Employee A or Employee B was ultimately to blame, a blameless postmortem would have each employee walk through their work processes and thought processes to try to get to the heart of the issue.
By walking through the process, we can identify where we can improve. Perhaps our training processes aren’t working. Perhaps the documentation was confusing. Maybe there’s a way to create checks and balances within our technical systems so that employees don’t have to remember who to check in with.
The point isn’t that blameless postmortems never identify who made a mistake. It’s that blamelessness opens up communication and acknowledges that IT incidents are complex and there may be multiple ways to improve in the future—without shaming or firing Employee A.
The value of effective blameless postmortems
For many, blameless postmortems may require a culture shift. But in our experience, the benefits outweigh the work it takes to get there. Blameless postmortems:
· Create a healthy culture between teams
If we’re not looking for another team to blame, we’ll be more effective at working together, communicating clearly and without fear, and having empathy for the teams around us.
· Decrease the chances of ignoring incidents for fear of blame
If an incident isn’t going to result in public shaming or firing, employees are more likely to communicate about that incident, bring it to the team’s attention, and share ideas for future fixes. If there’s a chance of losing a job, the incentive is to clam up and keep slip-ups to ourselves.
· Create an open, always-improving culture of learning
Blameless postmortems encourage teams to talk through what went wrong step-by-step and brainstorm ideas for improving. They also acknowledge that incidents are complicated and we’re all human—giving employees permission to embrace learning and change instead of defending their choices out of fear of consequences.
· Increase support and communication
If Employee A and B don’t have to blame each other for an outage, chances are their relationship will be stronger. Removing the fear takes the pressure off and gives people the chance to support each other.
· Free teams up to do their best work
Watching a teammate be blamed, shamed, or even fired for a misstep makes other employees less confident and more fearful about doing their own jobs. It can slow down operations and create obstacles to future progress.
Best practices for a blameless culture
Implementing successful blameless postmortems starts with laying a foundation for a blameless culture. Here’s where to start:
Communicate an open, mistake-friendly approach up front
Make sure teams know before the meeting even begins that this isn’t a witch hunt. It’s an opportunity for the company to learn and improve. People can be honest about assumptions, incorrect expectations, and missteps without fear of reprisal.
Encourage honesty and acceptance of failure
The detractors who say blameless postmortems don’t have enough accountability? Here’s where they’re wrong. Your postmortems should encourage honesty and accountability. Removing the fear of consequences frees people up to be honest about their missteps and misunderstandings. And that’s the only way to fix them.
Share information and build a timeline
Before you start digging into an incident, make sure everyone’s on the same page about what actually happened. A misunderstanding of the core issue can make an incident postmortem go quickly off the rails. This is why building a timeline of the incident is important.
Be consistently blameless
If one postmortem is blameless and others aren’t, the removal of fear and introduction of more openness won’t work.
Get C-suite buy-in
Blameless postmortems will be a culture change for most organizations. Make sure you sit down with company leaders to help them understand the benefits of blameless postmortems and blameless company culture before you begin. Culture shifts are only possible with top-level buy-in.
Even teams who weren’t directly involved in the incident may learn or contribute something in a postmortem.
Inviting different teams to a postmortem encourages cross-team collaboration and brings more perspectives to the table, ultimately improving incident management. Inviting someone from the security and privacy team, legal, or risk and compliance can help identify previously unknown contributing factors, other potential pitfalls in existing processes, and ways other teams can improve their support of technical systems and processes.
Make decisions, but get approval
A good blameless postmortem should result in some suggestions that help prevent future incidents. Make sure you identify who is responsible for approving recommended actions and reviewing the write-ups themselves.
At Atlassian, that person is a division-level head of engineering. They’re responsible for reviewing the conclusions and prioritizing agreed actions and mitigations after the postmortem.
A blameless postmortem success story
So, do blameless postmortems really improve results? Internally at Atlassian, all signs point to yes.
A couple years ago, an engineer made a big mistake with the syntax of a config file for a piece of critical equipment--and it took down the entire company for 45 minutes. If you quantify it, we’re talking hundreds of thousands of dollars.
But instead of shaming the engineer, we did a blameless postmortem. Because our goal wasn’t to punish someone for a mistake, it was to find out if there was a way to prevent that same mistake in the future. Humans make errors. There’s no getting around that. The question is how do we make it less possible for human error to happen? And to answer that, we needed to know what happened and why.
In the end, the simple, permanent fix was putting an automated “will it start” check on the config file before loading, and eventually removing all human interaction with the system's configuration. The issue that caused the outage is now prevented by a quick technical fix. The engineer involved still works at Atlassian and adds a lot of value to our team.
At Atlassian, we’re fans of simple, repeatable processes—and our blameless postmortems are no exception. We’ve come up with a process that works well for us and you can find a breakdown here or read about it in-depth in our incident handbook.
Get the handbook in print or PDF
We've got a limited supply of print versions of our Incident Management Handbook that we're shipping out for free. Or download a PDF version.