Incident management for high-velocity teams
The future of IT incident management, response, prevention
In the past, the team tasked with responding to technology incidents was almost always IT. Often a team sitting in a network operations center, or NOC, monitored systems and responded to outages. A vendor might have built the software, but deploying and operating was the responsibility of the customer's IT Ops team. Today, with the proliferation of cloud services, the vendor builds the software and does the deploying and operating.
Yet incident management still remains a core ITSM practice. And IT has a long history of developing guidelines, managing budgets, and carrying the full burden of diagnosing, fixing, documenting, and preventing major incidents.
Of course, as with anything in tech, the past is not necessarily a predictor of the future—and currently the practice of incident management is shifting. DevOps, SecOps, and architecture teams are getting more involved. New technologies and interconnected products have changed how we manage incidents. And mindsets, practices, and team structures are changing in order to keep up.
So, how is incident management shifting and what does that mean for the future of our roles, products, processes, and teams?
A move toward decentralization
Rewind five years and ask an IT team who was responsible for incident management. The answer you’d pretty much always get was “us.”
Ask the same question now and you’re likely to hear about not only IT, but also DevOps, SecOps, and architecture teams. Many organizations are slowly shifting toward the idea of “you built you, you run it.”
The clear benefits of this approach are that it takes pressure off the IT teams and speeds up response times by shifting responsibility to the people most familiar with the code. This minimizes down time and maximizes team productivity. It also incentivizes good code. (If you’re the one waking up at 3 a.m. to resolve a bug, chances are you’ll be double and triple checking the code the next time it goes live to keep that 3 a.m. call from happening again.)
The challenge of this approach is that organizations still need some centralization. Leadership needs access to reports and documentation. Business stakeholders want updates. They want to see incident metrics like mean time to resolve and mean time to acknowledge. They expect clear incident updates, incident postmortem reports, and remediation work.
For many companies moving toward decentralization and doing it well, the answer to this challenge is technology that allows for decentralization and team autonomy to keep incident resolution nimble and still centralize information to keep the business in the loop.
The slow road to decentralization
Like with any other big change that could disrupt workflows and surface unforeseen consequences, it makes sense that many organizations are taking on decentralization in baby steps.
They start by identifying a team that is a good cultural fit for a change like this and is managing a low-risk application or product. Then they move incident management for that team’s specific application or product to that team. They train them, implement an on-call schedule, and track their progress over time, asking questions like:
- Have they improved recovery times?
- What cultural barriers have they run up against?
- What tools did the IT team need to put in place?
- What processes did they need to communicate?
- Are better system updates coming out of that team?
- Has the number of incidents dropped?
- If we decide to roll this decentralization out to other teams, what can we take away from this initial test run?
These test cases work to provide a foundation for deciding whether to implement a “you built it, you support it” framework across the company and, if so, how to roll it out effectively across teams.
Decentralization means cross-team collaboration
This move toward decentralization also necessitates a move toward cross-team collaboration. If DevOps is involved in incident management, DevOps needs a seat at the table in IT incident management process meetings. If IT is still helping guide incident management practices, they need to be involved in postmortem reviews by other teams.
Each team brings their own strengths to the incident management table. IT teams are good at developing practices and documentation and following guidelines. DevOps teams are good at change and learning. SecOps can lend a security perspective.
To foster more collaboration across teams, companies doing this well are sharing information openly, fostering empathy across teams, getting rid of cross-team blame games, using chat to keep teams connected during incidents, and prioritizing incident reviews where everyone’s given a seat at the table.
The shift from reactive to proactive
In ITIL guidelines, typically incident management is seen as a separate practice from incident prevention. Both are important pieces of the ITSM puzzle, but they don’t often happen in tandem.
The problem with this approach is that it keeps incident management in a reactive state. On-call employees are tasked with putting out fires, and as soon as the fire is out, they move on to the next one. The only goal in mind is recovery—getting the system back up and running.
But recovery isn’t the whole picture. And more IT teams are realizing and embracing this over time, folding prevention into the process of incident management and using metrics like mean time to resolve instead of mean time to recovery to judge their performance.
This approach is often called problem management and its goal is to bring processes closer together—to make sure teams aren’t just responding to one fire and moving onto another, but that they respond, recover, and learn from the incident, applying those learnings to both the problem at hand and the larger product and service systems they’re managing.
Many enterprise IT organizations will have a dedicated practice for Problem Management. They typically treat it as a separate process for a separate team. At Atlassian we advocate for taking this even one step further and use a blended approach where IT Ops and developer teams include the problem management practice into their incident practices. This provides better visibly across the incident and ensures incident analysis doesn’t happen long after the incident actually happened.
Because, in the long term, there’s more value in preventing incidents than in responding to them quickly.
Staying the course with process and documentation
One of the challenges inherent in this shift to cross-team collaboration on incident management is that some teams are more relaxed than others about process and documentation.
This is one of the places where IT can provide oversight and significant value even as other teams take on management of their own products. Because nobody wants to take on a major incident bleary-eyed at 3 a.m. without a solid plan.
When folding teams into the incident management process, IT can help them answer the core questions that will determine that plan. For example:
- What is your incident response?
- What are the values you’ll follow?
- How will you respond in case of an incident?
- Where is the information you need for the critical systems you support? If it’s in multiple systems, how can you bring that information together and make it easily accessible to on-call experts?
- Is your process and documentation collaborative and reviewable by the team?
Is your company culture ready for change?
This shift toward decentralization, collaboration, and a blending of incident and problem management requires more than simply re-distributing responsibilities and scheduling an IT pro to sit in on a DevOps postmortem. The key to success here isn’t in the technology or even the processes. It’s in creating an internal culture that supports those changes.
This is the part too many companies try to skip and it’s the foundation for a successful transition. So, what does a culture that supports decentralized, collaborative, future-focused incident management look like?
At Atlassian, we think the core components are:
Openness and information sharing
If teams don’t know and can’t access what other teams are doing, we lose opportunities for ah-ha moments that lead to better communication, processes, and products.
When we ask questions like “what’s really best for the customer?” sometimes the answers we come up with don’t jive with our current practices. It takes an intentional customer focus to move us toward the kind of communication, process, and structural efficiencies that ultimately make our products better for customers.
Regular health checks
How is each team doing? How are individual team members feeling about things? What can the team improve on? What are they knocking out of the park? At Atlassian, we have a team playbook that helps us check the health of our teams and introduce them to new ways of working.
If DevOps is pointing the finger at IT and IT is rolling its proverbial eyes at the more relaxed approach of DevOps, that’s not a recipe for collaboration. Fostering empathy and connections across teams is essential if we want them to communicate, innovate, and work together well.
Teams should be empowered to fix problems quickly and make decisions independently whenever possible. Individuals within those teams should feel empowered to speak up if they have a question, suggestion, or concern—no matter their position on the team or their years of experience.
When junior developers feel like they can raise a hand in meetings and flag an issue—even when someone more senior was responsible for that code—the result is innovative new ideas, improved processes, and catching bugs before they go out into the code.