A lot of teams are asking us about how to do incident management when you’re suddenly remote.
We understand. Going remote can be scary, and few things are scarier than having a service outage you aren’t prepared for. Nobody wants to be in a situation where an important service is going down and the engineer who can help isn’t answering on Slack. And if your company isn’t used to working remotely, it can be harder than ever to be on the same page during an incident.
At the same time, working remote can make your incident team stronger than ever. Working remote can be a forcing function to realize some best practices that all incident teams can benefit from, remote or not. At Atlassian we’ve been practicing remote-first incident management for years, as we have teams distributed around the globe. An alert about a power failure affecting a cloud server in Oregon might have us waking up engineers in California and New York while looping in support teams in Sydney and updating stakeholders in Texas. We’re able to spin up organized, multi-continent incident response teams in a matter of minutes.
We didn’t figure this out overnight. We got here through years of experience and practice, and aim to constantly improve. Here are some tips for remote incident management teams. Many of them are detailed in our own Incident Management Handbook, which we’ve made available to the public and free.
Communication is more important than ever when incident response teams are remote. When an incident is underway at Atlassian, one of the incident manager’s first responsibilities is to establish communication channels: specifically, a dedicated Slack channel for that incident, a Zoom room for live discussion, and a Confluence page for capturing notes.
If these canonical channels aren’t established right away, things can go south quickly. Different Slack conversations happen in multiple places, and engineers are off having private one-on-one conversations that other team members don’t know about. Important information is buried, lost, or siloed. Duplicate work happens. It’s not good.
And good communication goes beyond the immediate response team. Stakeholder and end-user communication is especially important with distributed teams. When managers don’t see your team physically huddled in a war room and typing feverishly, they get nervous. If you don’t proactively send updates to the right people, you’ll be flooded with interruptions. By being transparent and sharing real-time updates with customers and stakeholders in the places they’re likely to find them (web pages, support portal, social media) you can deflect tickets and interruptions. This gives your incident team more space to work on the incident. We use Statuspage to communicate these updates with internal and external stakeholders during incidents.
Practice openness and transparency
During an incident, your most precious resource is time. You want to mitigate the incident by creating the most productivity in the smallest amount of time possible. Incident management is an exercise in efficiency.
Nothing grinds that efficiency to a halt like duplicate work and siloed communication. Two engineers repeating each other’s work, or multiple teams having different versions of the same conversation, is a dangerous waste of time. And it’s especially easy for this to happen when teams are remote.
That’s why openness and transparency among incident responders is so important. Yes, there might be sensitive information you can’t share with customers or the rest of your organization. But among the response team, aim to default to transparency.
Create a source of truth for incident information
Speaking of transparency, it’s good to create a source of truth for the incident, some record which captures what the incident is, what’s going on with it, and how to find more information. This gives the response team – no matter where they are – a place to get up to speed on the incident and see its status. Tools like Jira Service Desk and Opsgenie are among the most popular options for this.
For folks outside the immediate response team – stakeholders, customers, and colleagues – we use Statuspage as the source of truth for incidents and incident status.
Document, record, capture
It’s not unusual in an Atlassian office to see two engineers calmly working side-by-side on the same incident but in complete silence. You could walk right past them and never even know they’re working on something together.
This is on purpose. We know that capturing and documenting information about the incident during the response is key. Face-to-face conversations aren’t off-limits, but we know they have a way of generating ephemeral information. And ephemeral information isn’t great for incident response.
We want to be creating information other team members can see, learn from, and build on. We want information we can analyze and study after the fact. All that detail is data you can study in an incident postmortem. Postmortems help teams peer into the inner workings of their services and discover areas for improvement. They’re also an effective tool for building trust between leadership and the rest of your organization.
That’s why everything from team communication to notes, theories, and activities are documented and recorded. Seeing something funny in logs? Take a screenshot and put it in Slack. Discover a task you should tackle later? Drop it in Confluence or Jira. Capturing all this data creates a rich record of the incident that may be helpful in ways you don’t even realize yet.
Get stories like this in your inbox
Get your entire team operating from the same playbook
Like we say in our incident handbook, a good incident process should be simple enough for people to follow under stress, but broad enough to work for the variety of incident types you will encounter.
We think our handbook is broad enough to work for different kinds of incidents and different kinds of teams, and you’re welcome to use it yourself. Feel free to download it, copy it, adapt it, share it, and make these practices work for your teams too.