Get to know the incident response lifecycle

Hang around security and incident management pros long enough, and you’ll notice a pattern. The smartest people in these industries think in cycles, not straight lines.

Why is that? What’s that even mean? That means every incident and outage isn’t an isolated event with a beginning and end point (though it may seem like that). Incidents are a learning opportunity.

Just because a service is “operational” again, doesn’t mean your team’s work is over. Post-incident activities should have you putting plans on future roadmaps, changing the way you prepare for future incidents, and discovering new things to build which will prevent more incidents in the future. It’s a never-ending cycle of improvement, and there are a few different ways to think about the various stages, depending on what school of thought you subscribe to.

What is an incident response lifecycle?

Incident response is an organization’s process of reacting to IT threats such as cyberattack, security breach, and server downtime.

The incident response lifecycle is your organization’s step-by-step framework for identifying and reacting to a service outage or security threat.

Atlassian’s incident response lifecycle

Atlassian's incident response lyfecycle chart

1. Detect the incident

Our incident detection typically starts with monitoring and alerting tools. Though sometimes we first learn about an incident from customers or team members.

2. Set up team communication channels

An important first step is to set up the incident team's communication channels. The goal at this point is to focus team communications in well-known places, such as a dedicated Slack channel and video conference bridge.

3. Assess the impact and apply a severity level

Now it’s time to assess the impact of the incident so the team can decide who else to contact and what to communicate with customers and stakeholders.

4. Communicate with customers

We aim to communicate to stakeholders internally and externally as soon as possible. Communicating quickly and accurately helps build trust with customers and the rest of the organization.

5. Escalate to the right responders

Initial responders often need to bring other teams into the incident by paging them using an alerting tool like Opsgenie.

6. Delegate incident response roles

As additional team members join the response, the incident manager delegates a role to them.

7. Resolve the incident

An incident is resolved when the current or imminent business impact has ended. At that point, the emergency response process ends and the team transitions onto any cleanup tasks and the postmortem.

The NIST incident response lifecycle

Another industry standard incident response lifecycle comes from The National Institute of Standards and Technology, or NIST. NIST is a government agency which sets standards and practices around topics like incident response and cybersecurity.

NIST stands for National Institute of Standards and Technology. They’re a U.S. government agency proudly proclaiming themselves as “one of the nation’s oldest physical science laboratories”. They work in all-things-technology, including cybersecurity, where they’ve become one of the two industry standard go-tos for incident response with their incident response steps.

Like Atlassian, NIST believes that not every incident can be prevented. So it’s best to be prepared:

“Preventive activities based on the results of risk assessments can lower the number of incidents, but not all incidents can be prevented. An incident response capability is therefore necessary for rapidly detecting incidents, minimizing loss and destruction, mitigating the weaknesses that were exploited, and restoring IT services.” — NIST

The NIST incident response lifecycle breaks incident response down into four main phases: Preparation; Detection and Analysis; Containment, Eradication, and Recovery; and Post-Event Activity.

Phase 1: Preparation

The Preparation phase covers the work an organization does to get ready for incident response, including establishing the right tools and resources and training the team. This phase includes work done to prevent incidents from happening.

Phase 2: Detection and Analysis

Accurately detecting and assessing incidents is often the most difficult part of incident response for many organizations, according to NIST.

Phase 3: Containment, Eradication, and Recovery

This phase focuses on keeping the incident impact as small as possible and mitigating service disruptions.

Phase 4: Post-Event Activity

Learning and improving after an incident is one of the most important parts of incident response and the most often ignored. In this phase the incident and incident response efforts are analyzed. The goals here are to limit the chances of the incident happening again and to identify ways of improving future incident response activity.

Incident response for modern DevOps teams

Over the past decade, the DevOps movement has helped teams reshape how they build, deploy, and operate software. Along with that are innovations on how these teams respond to incidents.

The DevOps approach to managing incidents isn’t dramatically different from the traditional steps to effective incident management. DevOps incident management includes an explicit emphasis on involving developer teams from the beginning--including on call--and assigning work based on expertise, not job titles.

Incident response and continuous improvement

We started the article by talking about cycles vs. straight lines. You’ll notice something all these incident management approaches have in common: they are not linear. Each of them include the same basic component parts: ways of defining, detecting and identifying incidents; ways of quickly responding and taking action to mitigate incidents; and ways of analyzing incidents to improve future detection and response. There is no point in analyzing an incident that already happened just for the sake of that incident. You can’t go back in time and change what happened. You’re learning from the incident to improve the future detection and response. Constant, continuous learning and improvement is how teams close that cycle.

A continuación
On call