Incident response best practices and tips

The impact of an incident can be measured in tens, or hundreds, of thousands of lost dollars per minute. With so much at stake, organizations are rapidly evolving incident response best practices.

If organizations don't constantly iterate on their incident management process, they'll expose themselves to the risk of mismanaged incidents, unnecessary delays and associated costs.

Here are some of the common, and not so common, best practices and tips.

People looking at a Jira board

1. Always pack a jump bag

A “jump bag” for incident responders holds all the critical information teams need access to with the least amount of delay. While it’s more likely a digital document, having one centralized starting place for incident responders is a big help.

This could include a variety of things:

  • Incident response plans
  • Contact lists
  • On-call schedule(s)
  • Escalation policies
  • Links to conferencing tools
  • Access codes
  • Policy documents
  • Technical documentation & runbooks

2. Don’t run from runbooks

Runbooks offer guidance on what steps to take in a given scenario. They’re especially important for teams working on-call rotations when a system expert may not be immediately available. A well-maintained set of runbooks allows teams to respond faster and build a shared knowledge base of incident response practices.

3. Embrace chaos, promote stability

Chaos Engineering is the practice of experimenting with systems by intentionally injecting failure in order to understand how systems can be built more robustly. An example of this is Chaos Monkey. Originally developed at Netflix, Chaos Monkey is a tool that tests network resiliency by intentionally taking production systems offline.

4. Think outside the NOC

Historically, Network Operations Centers (NOCs) acted as the monitoring and alerting hub for large scale IT systems. Modern incident management tools allow for this process to be streamlined significantly. By automating alert delivery workflows based on defined alert types, team schedules, and escalation policies, the potential for human error and/or delays can be avoided.

5. Aggregate, not aggravate

Nothing is worse than receiving a continual barrage of alerts coming from multiple monitoring tools. By centralizing the flow of alerts through a single tool, teams are able to better filter the noise so they can quickly focus on matters that need attention.

6. Remember: knowledge is power

A basic alert conveys something is wrong, but it doesn’t always express what. This causes unnecessary delays as teams must investigate and determine what caused it. By coupling alerts with the technical details of why it was triggered, the remediation process can begin faster.

7. Have alerts for your alerts

The latin phrase “quis custodiet ipsos custodes” (“Who’s guarding the guards?”) identifies a universal problem. The monitoring tools IT and developer teams employ are as equally vulnerable to incidents and downtime as the systems they are designed to protect. Holistic alerting processes ensure that both the systems, and the tools that monitor them, are continually checked for health.

8. Stop the bleeding

A triage doctor knows that they are risking greater harm if they get bogged down in attempting to resolve all situations as they arrive. Their focus is on short term actions that stabilize a patient enough to move them along to more acute care. In tech fields, containment actions focus on temporary solutions (isolating a network, regressing a build, restarting servers, etc.) that at a minimum limit the scope of the incident or, more ideally, bring systems back online.

9. Don’t go it alone

Hero culture in IT and DevOps teams is a dying philosophy. No longer is it fashionable to be the lone engineer who works endless evening and weekend hours because they are the only person who can bring systems back online. Instead, teams are working as just that, teams. The chain is only as strong as its weakest link, work to nurture the entire team and not just one lone rockstar.

10. Be transparent

If users are met with a service disruption, it’s common for the incident to be made public in short order. To stay ahead of this, teams should have an incident communication plan in place. The goal is to build trust with customers by publicly acknowledging that a disruption is taking place, and to ensure them that steps are being taken to resolve it. Tools like Statuspage are great for distributing this information.

11. Learn from failure

Overwhelmingly, IT and DevOps teams will say that they only take the time to review “major outages.” While this is a good start, it often overlooks smaller incidents that may have a lingering impact. A lengthy report may not be necessary for all incidents, but a postmortem analysis is always useful.

12. Find the root cause (there is no root cause!)

Or is there? When analyzing an incident, it is rare that a single identifiable “root” cause can be named. Often systems are far too complex and interdependent to define a single root cause of an incident. Even if the root cause seems apparent (say a keystroke error that crashes an application), there is usually cause to understand what external factors may have allowed the application to crash (or not prevented it). Look for multiple root causes for a deeper understanding of your incidents.

13. Be blameless

The goal of every incident postmortem should be to understand what went wrong and what can be done to avoid similar incidents in the future. Importantly, this process should not be used to assign blame. That’s because teams that focus on the “who” and not the “what,” let emotions pull the analysis away from truly understanding what happened.

One more thing

In modern incident management environments, change is the only constant. This means systems will continually be stressed in new and different ways. Teams that understand this, also understand that it’s not a matter of if - but when - systems will fail. Taking steps to prepare for these failures should be recognized as a critical element of ongoing success, and integrated into the DNA of engineering teams.