Incident management for high-velocity teams

What can incident management teams learn from aviation?

It’s a well-known fact that flying is the safest way to travel—and that aviation has been aggressively improving its incident management for decades. In fact, in 1959, for every million flights, there were 40 fatal accidents. A decade later, that number had dropped to two. Today, it’s .1. 

Speaking generally, the stakes may be higher in aviation than in software (we’re probably less likely to die from an e-commerce outage than a plane equipment malfunction), but the day-to-day practice of incident prevention and management isn’t that different. Both industries manage risk, issue alerts and have to combat alert fatigue. Both industries need schedules that handle urgent round-the-clock needs. Both industries have incidents with varying severity levels. Both track KPIs religiously. And both are held accountable by the public and their customers.

Which is why tech can probably learn a thing or two from aviation’s uncompromising approach to improving their incident management and prevention. Here are five practices your team can steal from top aviation companies:

Design and launch with incident management in mind

In both aviation and tech, designing with incidents in mind can have a big impact on those incidents’ ultimate costs down the line. 

In aviation, the introduction of 16G seats in 1988 added protection against head and chest injuries and the possibility of being trapped in a seat due to deformation during a crash. The estimated benefit of these seats, in lives saved and injuries averted, totaled $78.9 million over 25 years. And all because of design that factors in the possibility of incidents

In the tech world, we get a similar benefit from the rise of “you built it, you run it”—which merges the responsibilities of development and incident management. One of the positive outcomes of this approach is that the teams tasked with building the technology are more aware of incident risks and more likely to work to prevent them and minimize their impact.

Automate to reduce the chance of introducing error

Pilot error is listed as the most common cause of aviation disasters. For software and IT incidents, humans are frequently the target of blame. Automation can help in both camps, and has been proven—across many industries—to significantly reduce errors. And so it makes perfect sense that aviation is moving toward more automation every year. Already, autopilot does about 90% of the flying and fully automated options are being tested. 

The prolific nature of human error is also why one of the big questions we at Atlassian ask in our postmortems is: Is there anything we can automate to prevent this from happening again? Because often an issue can be avoided with a simple technical fix.

One good example of this happened here at Atlassian a couple years ago: 

“An engineer made a big mistake with the syntax of a config file for a piece of critical equipment—and it took down the entire company for 45 minutes. If you quantify it, we’re talking hundreds of thousands of dollars...Humans make errors. There’s no getting around that. The question is how do we make it less possible for human error to happen?

“In the end, the simple, permanent fix was putting an automated ‘will it start’ check on the config file before loading, and eventually removing all human interaction with the system's configuration. The issue that caused the outage is now prevented by a quick technical fix.”

Clearly define priorities—and design alerts around them

If there’s one thing the aviation industry excels at, it’s ruthlessly narrowing priorities. Because the truth is, even in an emergency situation, some issues are more urgent than others. And when a plane is at risk for going down, you want your pilot to know—very clearly—which emergency requires their attention and in what order. 

This is why, though the computer is tracking over 10,000 data points in a plane at any given time, only 10% of all flights have even a single alert go out to the pilot. Does the pilot need to know about that window de-icer changing from a high to a medium setting? Do they need to know one hydraulic pump failed and another has taken over, with no impact to the plane or its flight path? The answers, according to aviation experts, are no and no. 

When alerts are needed—in the case of an engine failure or cabin pressure issue—and do show up in the cockpit, their priority levels are very clear, indicated not only via visual cues like text and red lights, but also by audio and physical cues such as a shaking steering mechanism or voice warning.

The highest alert level, as you might expect, has the most cues. If your plane is about to do a nosedive, the pilot is going to get a red text message, red lights, a voice warning, and a shaking steering mechanism.

The next level down has everything listed above, except no shaking stick. The next level down from that generates lights and a text message in yellow. And still the next level down, which doesn’t require any pilot action, is a simple yellow text message on the screen—a rigorous hierarchy that makes it simple for pilots to know what to pay attention to.

Set your alert thresholds high

In addition to clearly indicating priority in their alerts, the aviation industry is very good at understanding what needs to be an alert—and what absolutely doesn’t. 

The top priority level is reserved for only the worst of emergencies—the kind of emergency where if the pilot doesn’t take immediate and definitive action, the plane is going down. 

The second set of priority issues, known as warnings, also require immediate pilot action, but they aren’t taking the plane down at precisely that moment. This includes things like loss of cabin pressure or a traffic conflict that puts a plane in danger of collision.

The third tier is a caution, which requires pilot awareness but not an instant reaction. And this where aviation’s ruthless tier-setting becomes apparent. Because even an engine fire or single engine failure may only merit a caution

This uncompromising approach to prioritization has helped aviation combat alert fatigue, as well as keeping passengers safer.

Have playbooks and checklists at the ready

When an alert sounds and the pilot learns that the air conditioning unit has gone down (which can lead to a drop in cabin pressure) or one of the engines is in jeopardy, the aviation industry does not rely on that pilot’s training to resolve the incident.

Because while the pilot’s training will come into play, it’s safer (not to mention quicker) to communicate next steps directly. This is why cockpit alerts come with a checklist of next steps, designed to match up with the specific alert. While not exactly automation, this approach has a similar benefit. Instead of relying entirely on someone’s training, the system spells out what’s most likely to fix an issue.

Aviation’s dedication to optimizing IM practices shed light on how other fields, including tech, can continuously refine their incident response and management.

Learn more about how Jira Service Management can help teams respond, resolve, and continuously improve after an incident occurs.