Service continuity management - planning for disaster-level incident response and recovery

What is IT service continuity management?

IT service continuity management (ITSCM) is a key component of ITIL service delivery. It focuses on planning for incident prevention, prediction, and management with the goal of maintaining service availability and performance at the highest possible levels before, during, and after a disaster-level incident.

The goal of ITSCM is to reduce the downtime, costs, and business impact of incidents by putting effective, standardized processes in place for when those incidents do inevitably occur.

Because without a plan, there are a lot of factors that can slow—or stop—incident recovery. After all, your on-call expert might be responding when they’re bleary-eyed at 3 a.m. They might be out of touch with the code after working on something else for weeks or months. They might panic at the scale of the disaster-level incident. Or they might be the newest member of the disaster recovery team, without as much experience resolving issues.

Having a well-documented, clear plan for service continuity management will help minimize any delays caused by learning curves, time away from the code, disaster panic, or midnight alerts.

ITSCM and ITIL 4

In ITIL 4, service continuity management is a process meant to support business continuity management (BCM). The goal of the process is to make sure services are back up and running within the agreed-upon business timelines after major service disruptions.

ITSCM vs. incident management

ITIL 4 makes a distinction between incident management—which handles incidents at a variety of impact levels—and ITSCM, which is about planning for large-scale disasters.

So, what exactly constitutes a disaster? The answer may be different for each business, but the Business Continuity Institute defines it as: “A sudden unplanned event that causes great damage or serious loss to an organization. It results in an organization failing to provide critical business functions for some predetermined minimum period of time.”

The scale of what we call a disaster, the predetermined minimum time, and the definition of critical business functions are three things each business will need to define and document for themselves.

ITSCM and business continuity management (BCM)

Business continuity management is a process managed outside IT that identifies risks to the business and works to mitigate those risks. Some risks may be IT-related, including disaster-level incidents, and some risks may be outside IT control, such as natural disasters or facility fires.

Since BCM encompasses ITSCM as well as other risk-mitigation processes, it makes sense for IT teams to work closely with the BCM team to create:

  • A business continuity plan (BCP) that includes plans for prevention and recovery from disaster-level IT incidents
  • Business impact analyses (BIA) that identify the potential business impact of an IT disaster

ITSCM objectives

From a business perspective, the goal of ITSCM is to reduce the downtime, costs, and business impact of disaster-level incidents. On a more tactical level, objectives include:

  • Working closely with BCM to protect overall business continuity
  • Creating and managing plans for IT service continuity and recovery in case of disaster
  • Working with vendors to minimize the impact of any downtime in their products and services, as it relates to the business
  • Analyzing risk and impact and revising plans accordingly over time

The ITSCM process

Here at Atlassian, our own continuity plan, is built on the assumption that the process of disaster planning is ongoing, leadership-driven, and thoroughly tested. We are determined to not #@!% our customers. Our process includes planning, communication, clear responsibilities, testing, and continuous improvement.

Planning

The planning process starts with asking high-level questions and then building a plan based on your answers. Starting questions should include:

  • What is our incident response?
  • What are the values we’ll follow?
  • What kinds of disasters do we need to plan for? What are the risks and threats inherent to our business?
  • What systems do we need to support? Which are critical?
  • How will we respond in case of each disaster?
  • Where is the information we’ll need to support and restore critical systems?
  • How can we centralize that information and simplify restoration processes?
  • Is the information and process documentation collaborative and reviewable by the teams who will be managing it?

Once you have answers to these questions, the next step is to use those answers to define:

  • Policies for disaster recovery
  • Scope of IT responsibilities
  • Scope of business impact of each risk
  • Plans and processes for each risk scenario
  • Personnel and documentation requirements

The key to a successful ITSCM planning phase is documenting and templatizing the resulting plan to make it clear and repeatable.

Clear responsibilities

Who’s responsible in case of disaster? Who’s responsible for maintaining and updating plans, processes, and documentation? ITSCM should always have a clear sense of roles and responsibilities not only for disasters themselves, but for ongoing monitoring and improvement.

At Atlassian, part of our approach is to have regular disaster recovery meetings with our site reliability engineers and our risk and compliance team. They discuss gaps in disaster recovery and identify where additional plans, improvements, assessments, or changes need to be made.

Communication

Openness is a core value at Atlassian and we believe the more informed your organization is about your ITSCM plans, the more effective those plans will be.

Not only does communication keep stakeholders on board and help the c-suite stave off panic during a disaster-level incident, but it also allows the team to reach out for help from other teams if needed and mitigate the risk of friction caused by organizational confusion.

Testing

How do you know if your plans work unless you test them? This is a foundational question for ITSCM and the reason that testing and incident management drills are vital to the success of the practice.

Testing can help you identify weak points in your process, unforeseen issues, and where teams may need re-training or better documentation.

Assess and improve

ITSCM is not a one-and-done process. It requires thoughtful planning up front and ongoing training, assessment, and improvement. That’s why we have regular disaster recovery meetings. It’s why we test system backups and run drills on what happens in case of a data center outage or AWS region failure. And it’s why any ITSCM plan worth its salt is a continually monitored, ever-changing thing.

Most companies represent the ITSCM process as a series of steps, but we think it’s more like a circle. Planning should lead to defined roles and responsibilities. From there, the team should communicate across the organization, test and test again, assess, monitor, and improve and, in those improvements, continue to update the plan, further define roles, and continue communicating.

ITSCM roles and responsibilities

In order to effectively plan and implement ITSCM practices across the organization, many businesses appoint a Service Continuity Manager and a Service Continuity Recovery Team.

Service Continuity Manager (SCM)

As the name suggests, the Service Continuity Manager is responsible for overseeing service continuity. This person typically owns the process from A to Z, leading plan development, managing ongoing monitoring and assessment activities, and overseeing plans in action in case of disaster.

This person is typically an experienced, senior-level technical support professional, but may be in a management role and not directly involved with the tech day to day.

Service Continuity Recovery Team

Led by the SCM, this team is responsible for running tests and incident drills and continually improving ITSCM. The team typically includes technical staff, QA professionals or users for testing, and representatives from departments across the organization who are responsible for keeping lines of communication open between ITSCM and their teams.

Why does ITSCM matter?

Organizations with clear plans for disaster recovery will recover quicker and more fully in case of disasters.

ITSCM isn’t about planning for everyday outages. It’s about addressing worst-case scenarios and ensuring that if they happen, they cause minimal disruption to the lives of customers and employees.

Here are three clear benefits of a good ITSCM practice:

  • If disaster strikes, a good ITSCM plan means essential services will be back up and running quickly.
  • The organization is always prepared for a major disaster and can react quickly and appropriately.
  • Everyone across the business understands what will happen in case of disaster and how long they can expect systems to be down.