Take a page from our (hand)book.
Our mission at Atlassian is to unleash the potential of every team. One thing we know great teams have in common? They use playbooks to manage the many processes formulated to keep their organizations running smoothly.
Playbooks are a key component of DevOps and IT Ops incident management, as well as cybersecurity. They set the organization’s policies and practices for responding to unplanned outages, helping teams bring order to chaos and make sure everyone’s responding to incidents and security threats consistently.
What’s in an incident response playbook?
An incident management handbook gives your team a set of processes for responding, resolving, and learning from every incident, whether it’s a security issue or another emerging vulnerability The content can include everything from runbooks and checklists to templates, training exercises, security attack scenarios and simulation drills.
Here are a few critical elements an incident playbook should include. We’ll be using our own Atlassian Incident Management Handbook for examples here.
A good incident process is fast and predictable. It quickly turns detection into response, escalates to the right people on the shortest path, makes communications clear and keeps customers in the loop. It’s simple enough for people to follow under stress, but broad enough to work for the variety of incident types you will encounter.Atlassian Incident Management Handbook
Get everyone on the same page
What to include: A specific definition of what constitutes an incident
Why: You can’t effectively resolve an incident if you don’t know when it’s happening. Different teams define incidents in different ways. If something goes wrong, every second matters, and you don’t need colleagues fighting over semantics.
Avoid confusion by defining roles
What to include: Incident roles and responsibilities
Why: A proper incident response playbook designates clear roles and responsibilities. Individuals on the incident response team are familiar with each role and know what they’re responsible for during an incident.
Enforce a consistent process
What to include: Process steps and workflows
Why: No two incidents are exactly alike. But that doesn’t mean your responders can’t introduce a consistent workflow for responding to incidents.
Outline key steps and phases and make sure team members are clear on what’s expected during each phase – and what comes next.
Enable rapid response
What to include: Templates and checklists
Why: Incident playbooks need to be simple enough for teams to follow in times of stress. Our own process includes a major incident manager “cheat sheet,” which outlines key steps like assessment, escalation, and delegation in a one-page format.
Following a predetermined incident response process doesn’t mean there’s no room for improvising. You have to be flexible, and know when to call an audible. But teams who train and practice a set of plays – whether on the basketball court or in the boardroom – are typically the ones who succeed.
The same is true for incident management teams. Incidents, by definition, are scenarios where things don’t go according to plan, but that doesn’t mean you can’t plan for them.
An incident management playbook empowers teams with standard procedures and steps for responding and resolving incidents in real time. Playbooks can also include peacetime training and exercises, which will prepare the team for the next incident.
We often role-play the response to an on-call issue that has already been solved, perhaps by colleagues in another location, to further familiarize ourselves with incident management.Site Reliability Engineering: How Google runs production systems
At Atlassian, our incident teams are constantly training, refining, testing, and improving our incident management process.
Here are a few of our favorite incident management playbooks, tutorials, guides, and resources, so you can get started building your team’s own plan of action:
The Atlassian Team Playbook offers teams of all types practical plays they can run right away to improve how their team gets things done. Check out the plays on Incident Response Values and Incident Response Communications.
Inspired by teams like Google, we’ve created our Atlassian Incident Management Handbook as a summary of Atlassian’s incident management process. These are the real lessons we’ve learned and practices we’ve adopted responding to incidents for more than a decade.
Many of our tutorials, tips, and best practices for incident management live at Atlassian’s incident management hub. Learn to set up an on-call schedule with Opsgenie, build your incident communication templates in Statuspage, and plenty more.
This interactive incident communication workshop helps you get the right people and processes in place for communicating incidents and downtime to your customers.
Get stories like this in your inbox