Incident management for high-velocity teams
Creating better incident timelines (and why they matter)
As technology gets more complex, so too does incident management. And as incident management gets more complex, so too does documentation and communication.
Which is why more and more companies are embracing incident timelines—a centralized incident activity feed designed to keep teams on the same page during an incident and provide a record those same teams can use post-incident to identify root causes and improve future performance.
What is an incident timeline?
An incident timeline is a complete real-time record of an incident. It often includes manual entries (chat), consolidated records of pages, alerts, and acknowledgements, and automatic system updates (for example, notification that someone has changed the severity level of an incident or marked it as resolved). It’s also often synced with chat or a Slack channel.
The timeline is there to keep the team on the same page, get new team members up to speed quickly, and simplify the process of incident postmortems.
“Get me a list of all the changes made in the past, say, three days. Without an accurate timeline, we won’t be able to establish cause and effect, and we’ll probably end up causing another outage.”
— From “The Phoenix Project,”
Gene Kim, Kevin Behr, George Spaffor
The value of an incident timeline
A single real-time view
One of the quickest ways for an incident to get out of control is a lack of communication between teams or stakeholders. An incident timeline mitigates this risk by giving everyone the same information in a single view in real time. This means everyone—from the developers working on the incident to the communications team responsible for updating users to c-suite stakeholders—can stay up to speed without complicated games of telephone or multiple disconnected email threads, phone calls, and chats.
The single real-time view also makes it simpler for stakeholders to identify not only the core problem of the incident, but also risks and potential problems in interconnected systems. Giving multiple teams access to a timeline makes it easier to identify potential problems, causes, or risks in interconnected systems.
More robust incident postmortems
At Atlassian, incident postmortems are an essential part of our incident and problem management processes. They bring people together to figure out what happened, why it happened, and what we can do to prevent it from happening in the future. To get to the bottom of those questions, it helps to have a detailed record of everything that happened during an incident—from alerts to stakeholder updates to the final fix.
For many companies, the incident timeline acts as that detailed record. It’s not only a tool for real-time collaboration on incidents. It’s also a single view of what happened, when, and sometimes why—information that can save teams hours upon hours during the postmortem review phase.
Digging deeper into KPIs
An incident timeline often helps teams get to the bottom of a single incident, but its usefulness doesn’t stop there. It can also be used alongside timelines for similar incidents to help teams spot patterns and diagnose larger problems with important KPIs.
If an incident took longer than average to resolve, where were the points of failure? How does that match up with other similar incidents? Which parts of the process need a closer look? Is there a pattern that can lead us to a larger issue with process, technology, or team setup? Are alerts going out as needed or do we need to revisit our alerting thresholds? Is the on-call schedule giving incidents sufficient coverage? Are our teams structured the right way?
A timeline can act as a single data point for review or one of many data points in an investigation into SLA and SLO issues.
Incident timelines vs. ChatOps
Incident timelines are typically provided by and used within incident management systems like Opsgenie to centralize all incident information.
ChatOps for incident management has the same goal. The only difference is that instead of being housed in an incident management system, ChatOps typically centralizes the timeline in a chat program like Slack, which syncs with and pulls in information from incident management platforms like Opsgenie and any other relevant sources.
The benefits of ChatOps—access to the same information across teams, real-time conversations and updates, less context-switching, no more games of telephone, and a built-in record for postmortems—are the same benefits that an incident timeline promises. The core difference is simply the location and amount of information. For most incident teams, the ChatOps feed typically has a lot of noise surrounding the important information. It’s helpful to pull the rich details into your incident timeline, while retaining the chat log for future reference should you ever need it.