Incident management for high-velocity teams
The language of incident management
A glossary for incident management teams
Language used across the tech ecosystem is dynamic, to say the least. Nowhere else can you find a mix of technical jargon seamlessly intertwined with references from science fiction, mythology, pop culture, history, and literature. While this makes conversations colorful and engaging, it also makes them often tough to pin down.
When nothing urgent is going on, this works. But when incidents happen and severity levels shoot upward, we need our language to be technically precise, actionable, and leave no room for misinterpretation.
This means that when it comes to incident management, we need a clear set of definitions to keep people on the same page.
Incident acknowledgement (ack)
After an incident alert is generated, a user can acknowledge (or “ack”) an alert in most on-call alerting tools. This means that user has taken responsibility for the issue and is working to resolve it.
An actionable alert is an alert that clearly describes an issue and its impact and is routed to the right people at the right time so that the team can immediately take action.
Systems equipped with active monitoring are regularly checked or automatically monitored with software for any performance changes that might lead to incidents.
After-action review (AAR)
An after-action review is a structured review process that takes place after an event. The process typically describes what happened in detail, attempts to identify why it happened, and pinpoints areas for improvements to prevent the same or similar events in future. After-action reviews are also commonly known as postmortems or post-incident reviews..
Agreed service time (AST)
Agreed service time is the amount of time, usually measured in hours per year, that a service is expected to be available. This agreement is usually outlined in an SLA (service level agreement) between vendor and client. High availability services typically promise 99.99% uptime, which allows for less than an hour of downtime per year.
An alarm or warning generated when monitoring tools identify changes, high-risk actions, or failures in the IT environment.
Alert noise occurs when an overwhelming number of alerts are created in a short time, making it difficult for responders to accurately identify which services are affected and how to prioritize their work. Alert noise can be a contributing factor in alert fatigue.
Alert fatigue occurs when incident responders become overwhelmed by the volume or frequency of alerts. Alert fatigue often leads to slow responses—or no response—as responders tend to normalize the constant alerts.
A service that is expected to run continuously.
Components of any system or network that has business value. Asset management is when a worker or team takes stock of those components to understand the impact of an update or the removal of a system.
A formal examination of a system or process’ availability and use, as well as whether policies, guidelines, and best practices are being followed.
When a product or system is available and functioning as expected. Also known as system uptime.
The practice of restoring a service to a previous reliable state or baseline. This is typically a quick fix applied when an update or release breaks something essential in a system.
A stored copy of data or a redundant system available for use in case the original is compromised or lost.
A reference point for expected behavior. Baselines help teams measure changes and improvements.
A reference point that functions like a baseline to measure progress or compare results. For example, if the standard in our industry is 99.99% uptime, that may be a benchmark we use to measure ourselves against the competition and customer expectations.
An unintentional problem in software, code, programs, etc. that may cause abnormal behavior or failure.
Business impact analysis (BIA)
A business impact analysis is the systematic evaluation of the potential impact of service disruptions and downtime due to a major incident. The BIA’s goal is to understand the effect each service has on the business and define requirements for recovery in case of an incident.
The maximum amount of information that can be transferred between networks or delivered via a service. Exceeding capacity is a common indicator for incidents.
Any alteration made to an IT service, configuration, network, or process. Often tracked in a practice known as change management.
A comprehensive record of changes made to an IT service, configuration, network, or process from the beginning of its lifecycle to current state.
An IT practice focused on minimizing disruptions during changes/updates to critical systems and services. For some teams, this practice encompasses all aspects of change—from the technical to the people and process side of things. For other teams—based on the ITIL 4 guidelines—change management focuses on managing the human or cultural aspects of change, while another practice called change control focuses on risk assessment, schedules, and change authorization.
The practice of using chat and collaboration tools for incident management. As Atlassian’s Sean Regan explains:
“ChatOps is a collaboration model that connects people, tools, process, and automation into a transparent workflow. This flow connects the work needed, the work happening, and the work done in a persistent location staffed by the people, bots, and related tools.”
An incident is in a closed state when all necessary actions have been taken and an issue is closed.
Cold standby (gradual recovery)
A cold standby is used when a system acts as a backup for another system. If the primary system fails, the cold standby replaces the primary system while it is being fixed. This is a particularly useful strategy if the primary system failure requires a gradual recovery (a recovery that may take weeks) in the event that computing hardware needs to be replaced and set up.
A cold start occurs when an application that isn’t running takes longer to start up than an application that’s “warm” or already running.
The team member in charge of communication during an incident.
Alignment with regulations. Often, monitoring systems will be programmed to monitor for compliance issues and trigger alerts if a system falls out of compliance.
Component failure impact analysis (CFIA)
The process of determining the impact on a service if one component or configuration stops working as expected.
The measure of how many of the same actions are happening simultaneously within a system. For example: How many users are accessing the same operation or performing the same transaction?
Procedures and policies that manage risk, ensure a product or service operates as expected, and protect compliance.
A service that serves a central function for users/customers.
A specific reactive action taken to protect a system or restore operations.
Services that customers use and interact with.
A decision-making construct that has been adapted to incident management processes to help managers organize the most effective response. The framework divides situations into five categories based on the complexity of an incident, and each category has its own (different) set of next steps.
A single-screen visualization of systems, alerts, and incidents designed to organize the presentation of information from a variety of tools with contextual information provided in a clean, precise format.
The relationship between two services, processes, or configurations that rely on one another to function.
When a feature or tool is taken out of service, is no longer in use, or is no longer being updated.
The process and result of understanding an incident and its root cause.
The symptoms or signs that lead to an incident diagnosis.
Time when a service is not performing or available as expected.
An update or patch deployed rapidly, usually as a part of incident resolution. Emergency changes often skip change approval processes because the risk of waiting for approvals is greater than the risk of deploying the change.
A service necessary for a core service to work, but that is not offered to customers outright itself.
The infrastructure where a service, feature, process, configuration item, etc. is tested for expected functionality. This environment is controlled closely to mirror production.
The infrastructure where a service is delivered to a customer. The deliverables in this environment are live, and it is sometimes also referred to as the live environment.
A mistake that causes the failure of a configuration item or service. This can be a mistake in design, processing, or human error.
The process of moving an incident management assignment to a team or individual with more relevant skills or experience. Functional escalation is when an alert or incident is transferred to an individual or team with more expertise. Hierarchical escalation is when said alert or incident is transferred from a junior person to a senior person.
A notable system or service situation. Events are usually caused by either user action or an incident.
A report generated when key performance indicators (KPIs) are exceeding their thresholds or not meeting expectations.
Fault tolerance describes a service’s ability to continue operating even if a configuration item or individual part fails.
Fault tree analysis
A technique used to determine the events that led to an incident and predict what events might lead to incidents in the future. It’s often used to find the root cause of a major incident.
The responder expected to react first to an incident. This is typically the person on call.
An action or method of repair.
A fixed asset is a physical, valued, long-term part of the business, such as an office, computer, or license.
A method of customer support or incident management that rotates on-call responsibilities across time zones to provide 24/7 coverage without requiring teams to be on call in the middle of the night.
A scientific, evidence-based investigation into a computer system for the purpose of identifying the cause of an incident.
A service is described as functional when it is able to perform as expected.
A gradual recovery is a recovery process that takes longer than usual (weeks, not hours). When this happens, typically a cold standby (backup system) will be brought online to take the place of the affected system.
A hot standby is a recovery option where redundant assets run simultaneously to support an IT service in case of failure. If the active system fails, the hot standby is already running and ready to take its place with no action required by the team and no downtime. Also known as immediate recovery.
An update applied to software to solve a problem or fix a bug. This is often used to fix a customer-reported issue.
The measurement of cost—of money, time, reputation—that a service disruption, incident, or change causes. Also known as the cost of downtime.
An alert that does not empower a responder to take action. This often means the alert lacks contextual information, has been routed to the wrong person, or has an unclear scope. Inactionable alerts can contribute to alert fatigue.
An event that causes disruption to or a reduction in the quality of a service that requires an emergency response. Teams who follow ITIL or ITSM practices may use the term major incident for this instead.
How teams react to an incident. Usually, incident response is a pre-set process with rules, roles, and best practices defined before an incident arises.
The process used by DevOps and IT Operations teams to respond to an unplanned event or service interruption and restore the service to its operational state.
The incident commander is a member of the IT or DevOps teams who is responsible for managing incident response. The commander is the head of the incident management team and has ultimate control and final say on all incident decisions. This role is also often called incident manager.
The life of an incident from creation and detection to resolution.
A collection of metrics that measure input and output. Common metrics in this category include IO Wait (the time a CPU waits for an IO request) and IOPS (the number of IO requests per second).
Incident response orchestration
An OpsGenie feature that lets teams quickly and effectively identify problems, notify the right people, facilitate communication across business units, and collaborate across teams for incident management.
A record of the details of and processes used during a specific incident.
Individuals and/or teams responsible for the investigation and resolution of an incident.
Individuals who need to be kept in the loop on an incident because it impacts their job/ability to do their job. These individuals may or may not influence incident resolution, but they are not active responders.
Also known as warm standby, this type of recovery typically takes 24 - 72 hours. Data restoration and/or hardware and software configuration are usually the reason for the relatively long recovery time.
Information Technology Infrastructure Library (ITIL)
A documented set of widely accepted best practices for IT services.
Information Technology Service Management (ITSM)
All aspects of the processes and procedures required to deliver an IT service to customers. This includes all aspects of the service lifecycle – from design to delivery to incident management.
Kepner Tregoe method (KT method)
A root cause analysis and decision-making method where problems are evaluated separately from the final decision about an issue.
Key performance indicators (KPIs)
Measurements of success for systems or products. KPIs are decided in advance, tracked regularly, and often generate alerts if they deviate from their expected thresholds. For example, if your mean time between failures (MTBF) starts getting shorter and shorter, an alert may be generated so that your team can identify and look into the problem.
A preexisting issue that already has a workaround.
A delay experienced during the transfer of data.
The records of all events related to a service or application. This includes data transferred, times and dates, incidents, changes, errors, etc.
The measure of how easily changes can be applied successfully to a service or feature.
A solution implemented manually (as opposed to automatically).
Mean Time Between Failures (MTBF)
The average time between repairable failures of a technology product. This is also known as mean time between service incidents (MTBSI).
Mean Time to Acknowledge (MTTA)
The average time it takes from when an alert is triggered to when work begins on the issue.
Mean Time to Failure (MTTF)
The average time between non-repairable failures of a technology product.
Mean Time to Repair (MTTR)
The average time it takes to repair a system (usually technical or mechanical). This includes both the repair time and any testing time.
Mean Time to Recovery (MTTR)
The average time it takes to recover from a product or system failure. This includes the full time of the outage—from the time the system or product fails to the time that it becomes fully operational again.
Mean Time to Resolve (MTTR)
The average time it takes to fully resolve a failure - including time spent ensuring the failure won’t happen again.
Mean Time to Respond (MTTR)
The average time it takes to recover from a product or system failure from the time when you are first alerted to that failure. This does not include any lag time in your alert system.
A representation of an actual system, service, application, etc.
The repeated process of checking a service or process to make sure it is functioning as expected.
A non-emergency change that doesn’t have a defined, pre-approved process.
A schedule that ensures that the right person is always available, day or night, to quickly respond to incidents and outages. On-call schedules are common in both medicine and tech..
The physical location where IT service monitoring takes place.
The person responsible for overseeing daily operations. In some cases, this person may also be the incident manager (or incident commander), responsible for leading incident resolution.
The result of an IT-related event, process, or change. Teams often talk about both anticipated outcomes and actual outcomes.
Pain value analysis
An analysis used to identify the business impact of an incident. This usually factors in the cost of downtime, duration of an incident, impact on users, and number of users affected.
When service functionality is automatically monitored (rather than actively or manually monitored).
When services and operations are functioning as expected without any disruption.
A measure of how much the performance of a system has decreased due to an event or incident.
The period of time when an IT service is intentionally unavailable for the purpose of maintenance or updates.
A collection of “plays” or specific actions a team can take to address a specific problem, incident, or goal.
Postmortem/post-incident analysis/post-incident review
The process of understanding an incident after it has been resolved. The goal of a postmortem is to improve response processes, prevent future incidents, and understand the cause of the most recent incident.
The order in which incidents should be addressed. High priority items require more urgency than lower-priority items. Priority is determined by urgency, severity, and potential impact on the business.
A problem record is a document that covers every aspect of an issue – from detection to resolution.
Projected service outage
A document outlining how future maintenance or testing will impact normal service levels.
The process of testing to ensure standards are met for anything IT-related – from new features to how-to guides.
Quality management system
The framework or systems in place to provide quality assurance.
Monitoring that is done in reaction to an event or incident.
The process of returning a service to baseline functionality and health.
Recovery point objective
The maximum data loss allowed during recovery.
Recovery time objective
The maximum time tolerated for a service interruption.
A change deployed to users.
The planning, design, testing, scheduling, troubleshooting, and deployment of changes.
A system’s ability to resist failure and recover quickly in the event of an incident.
The amount of time it takes from when an alert is generated to when an initial action is taken by the team.
The process of identifying an asset’s risk by assessing its value, potential threats, and the potential impact of those threats.
The process of handling threats by identifying and controlling them.
The root cause is typically thought of as the singular reason a service or application failed. However, there are often many interconnected factors that contribute to failures, so teams are starting to debate whether this term is helpful in incident management, many have switched to the plural form: root causes.
Runbooks provided detailed procedures for incident management. These are typically maintained by a system administrator or network operations control (NOC) team. Runbooks can be digital or printed.
The extent of a problem, solution, project, capability, etc.
Second line support
People with additional capabilities—time, experience, knowledge, resources—to solve issues that may be beyond the ability of first responders.
Updates, fixes, depreciation, or other changes made to a service.
A team that takes customer support requests and serves as a point of contact between customers and IT.
Service failure analysis
Service failure analysis is the process of inspecting a service disruption to identify its cause.
Service Level Agreement (SLA)
An agreement between provider and client about measurable metrics like uptime, responsiveness, and responsibilities.
Service Level Agreement Monitoring (SLAM) chart
A document that records progress and data on service level targets.
Service Level Objectives (SLO)
An agreement within an SLA about a specific metric like uptime.
Severity (SEV) levels
The degree to which a service is affected by an incident. Typically, teams use a 3- to 5-tiered SEV level structure with 1 being the highest severity and 3 to 5 indicating lower severity issues that don’t require as much urgency.
Single point of failure
One variable that a system depends on in order to function. For example: an essential configuration item.
A formal record of requirements for an IT-related configuration.
Site reliability engineer (SRE)
A software engineer tasked with operations. SREs are typically responsible for automating manual tasks, managing SLOs, and managing incidents.
Low-risk, commonly repeated, pre-approved changes, such as adding memory or storage.
Inactive resources available to support incident management.
The current condition of a service.
A dedicated home for communicating the current condition of a service, with regular status updates on incidents.
Subject matter expert (SME)
An individual with specific knowledge on a particular issue, service, etc.
The programming languages, software, and components that make up an application. There are two sides to a tech stack: front-end (customer-facing) and back-end (developer-facing).
Data that, when one set or point is changed, negatively impacts other data points.
A pre-defined level or number that, when exceeded, generates an alert. For example, the threshold for sign-in page to load might be three seconds. If the page starts taking longer to load, an alert will generate.
A comprehensive list of events, changes, fixes, outcomes, and when each happened during an incident.
An investigation into time-related patterns. Trend analysis assumes that past patterns can predict future patterns in the data. This makes it a valuable practice for incident prevention.
A successful way of implementing a quick fix that gets system functionality back up and running even if the underlying incident is not yet resolved.
The resources—both human and machine—needed to deliver an IT service.
Setting up an on-call schedule with Opsgenie
In this tutorial, you’ll learn how to set up an on-call schedule, apply override rules, configure on-call notifications, and more, all within Opsgenie.Read this tutorial
Pros and cons of different approaches to on-call management
On call teams are rapidly evolving. Explore the pros and cons of different approaches to on call management.Read this article