Close

针对高速团队的事件管理

有效事件响应的 7 个阶段

In the midst of daily operations, an IT leader suddenly receives a barrage of alerts — a service outage threatens to disrupt their system. However the seasoned incident management team has faced similar challenges before and swiftly springs into action. By following a well-rehearsed plan and incident response best practices, they coordinate to mitigate the issue, limit damage, and restore operations, averting customer impact.

Incident response should not be reactionary but a well-defined series of practices and processes that you implement when unforeseen events occur. By understanding the structured incident response lifecycle, companies gain guidance through a strategic framework to swiftly identify, react to, and neutralize disruptions or security threats, ensuring a prompt return to normal operations.

This guide will cover the incident response lifecycle and its phases, the types of security incidents, and essential tools for effective incident management. Additionally, it will address key team members, potential challenges, and insights to streamline and fortify incident response strategies.

什么是事件响应?

事件响应是组织应对网络攻击、安全漏洞和服务器停机等 IT 威胁的过程。

其他 IT 运维和 DevOps 团队可能会将此实践称为“重大事件管理”,或简单地称为“事件管理”。

事件响应流程

以下几小节根据我们自己的事件手册中的材料,描述事件响应流程,介绍从发现服务中断开始到重新启动并运行服务为止应该做些什么。

在本文中,我们将介绍事件响应的七个关键阶段:

  1. 检测事件
  2. 建立团队沟通渠道
  3. 评估影响并应用严重性级别
  4. 与客户沟通
  5. 上报至正确的响应者
  6. 委派事件响应角色
  7. 解决事件
事件响应工作流程

检测事件

理想状态下,监控和警报工具可在客户注意到之前检测到事件并通知您的团队。但有些时候,您首先是从 Twitter 或客户支持工作单得知事件的。

无论事件是如何检测到的,您首先要做的都应当是在事件跟踪工具中记录打开了一个新事件。在 Jira Service Management 等事件管理解决方案中,警报和通信功能与您的跟踪工具集成在一起。

建立团队沟通渠道

当事件经理 (IM) 上任时,他们的第一件事就是建立事件团队的沟通渠道。此时的目标是在众人皆知的位置建立并关注事件团队的所有沟通内容,例如:

  • Slack 或其他消息服务中的聊天室。
  • Zoom 等会议应用中的视频聊天(或者,如果大家共处一地,可以将团队召集到一个实体房间)。

我们更喜欢在事件中同时使用视频聊天和文字聊天工具,因为这二者各有所长。视频聊天的优势在于通过小组讨论快速创建一幅有关事件的共享心理图像。而 Slack 则可帮助生成带有时间戳的事件记录,以及指向屏幕截图、URL 和仪表板的集中链接。

Slack 和大多数其他聊天工具都允许用户设置房间主题。事件经理应将此字段用于提供有关事件和有用链接的信息。

最后,IM 会将自己的个人聊天状态设为他们正在管理的事件的事务关键字。这样,同事们就会知道他们正在忙于管理事件。

Preparation

Preparation is the core of an incident response plan and determines a company’s responsiveness to an attack. A well-documented pre-incident process facilitates smooth navigation through intense, high-stress scenarios.

Any company will be more resilient with a robust incident response process based on the Atlassian Incident Handbook.

Identification

This phase involves detecting and verifying incidents through error messages, log files, and monitoring tools. Incidents might be identified through social media or customer support tickets, requiring the response team to manually record the incident in an incident-tracking tool.

Tools like Jira Service Management centralize all alerts and incoming signals from your monitoring, service desk, and logging applications, making it easy to categorize and prioritize issues.

Containment

Once you detect an incident, containment helps prevent further damage. During containment, the response team aims to minimize the scope and effects of an incident.

Eradication

Following containment, the primary focus shifts to removing threats from the company’s network or system. This phase involves a meticulous cleansing of all systems, removing any lingering malicious content to minimize the risk of potential reinfection.

Companies start restoring normal operations by conducting a comprehensive investigation and successfully eliminating threats.

Recovery

After eradicating the threats, the team focuses on restoring the affected systems to their pre-incident state. Data recovery and system restoration are vital for minimizing further losses and ensuring smooth operations.

Lessons learned

Incident debriefings are crucial to refining incident response strategies. The team reviews documentation, evaluates performance, and implements change to enhance incident handling efficiency. Every incident is a learning opportunity for the incident response team.

Tools for effective incident response

Teams need specialized tools, such as security information & event management (SIEM) systems, intrusion detection systems (IDS), forensic tools, and communication platforms, for streamlined incident response processes. 

Tools like Jira Service Management play a critical role in reducing resolution time and negative impacts. They automatically limit noise and surface the most crucial issues to the right team using powerful routing rules and multiple communication channels. 

评估影响并应用严重性级别

事件团队的通信渠道建立后,就需要评估事件了,以便团队决定要告诉大家的内容以及需要谁去解决问题。

IM 需要向团队提出下面的一组问题:

  • 对客户(内部或外部)有什么影响?
  • 客户看到了什么?
  • 有多少客户受到了影响(部分、全部)?
  • 什么时候开始的?
  • 客户开立了多少支持案例?
  • 是否还有其他因素,例如,Twitter、安全性或数据丢失?

下一步通常是分配严重性级别

Incident response: Frequently asked questions

Why is incident response important?

A well-structured incident response plan minimizes incident impacts, enabling businesses to act swiftly and efficiently against threats. It reduces recovery time, financial loss, and reputational damage.

Who should be on an incident response team?

The incident response team should be diverse and include various roles and responsibilities. The team should include the incident commander, technical leads, communications managers, customer support leads, subject matter experts, social media leads, and problem managers. Executives and leaders across multiple domains within the company should coordinate the team.

What are some challenges of incident response?

Incident response teams often face an array of challenges, from resource constraints to issues with context, prioritization, communication, collaboration, stakeholder visibility, and the occasional human error. Preparedness is crucial to anticipate and tackle these challenges effectively. For example, involving the legal team in the preparation stage can mitigate potential legal or regulatory hurdles.

后续内容
Best Practices