Our approach to security incident management
Our Approach to Handling Security Incidents
Atlassian has a comprehensive set of security measures in place to ensure we protect customer information and offer the most reliable and secure services we can. However, we also recognize that security incidents can (and do) still happen, and so it's just as important to have effective methods for handling them should they arise.
As a result, we have a clearly defined approach for responding to security incidents affecting our services or infrastructure. Our incident response approach includes comprehensive logging and monitoring of our products and infrastructure to ensure we quickly detect potential incidents, supported by carefully defined processes that ensure there is clarity in what we need to do at all stages of an incident. This is supported by a team of highly-qualified on-call incident managers who have significant experience in coordinating an effective response. We also have access to a range of external experts to assist us with investigating and responding as effectively as possible. We have structured our incident management approach on guidance from NIST 800-61 Computer Security Incident Handling Guide, and we catalog our incidents according to the Verizon VERIS framework.
More on our philosophy and approach
We consider a security incident to be any instance where there is an existing or impending negative impact to the confidentiality, integrity or availability of our customers' data, Atlassian's data, or Atlassian's services.
We’ve previously qualified the the impact with the word 'intentional', however it has been removed so that accidental data leaks etc. are included.
Core to the way we respond to security incidents is ensuring that we uphold our values, and in particular making sure we "Don't #@!% the Customer (DFTC)". We're focussed on putting the best processes in place so that we handle security incidents in a way that is always aligned with the best interests of our customers and ensures they continue to have an outstanding experience using our products. To that end, we've developed an incident response process that is robust and incorporates several features discussed below.
Several avenues to detect potential incidents quickly
We have several monitoring mechanisms in place to detect failures or anomalies in our products and infrastructure that may be an indicator of a potential security incident. These systems alert us immediately if an activity is detected that requires further investigation. We have an aggregated log capture and analytics platform which collates logs in a single location, so our analysts can investigate quickly and thoroughly, and our Site Reliability Engineers monitor the platform to make sure it’s always available. We also create alerts in our security information and event application that notify our teams proactively.
We also maintain external reporting channels through which we may become aware of vulnerabilities or incidents, including our Bug Bounty program, our customer support portal, and defined security email inboxes and phone numbers.
An established framework for managing security incidents
To ensure our incident response process is consistent, repeatable and efficient, we have a clearly defined internal framework that covers the steps we need to take at each phase of the incident response process. We have documented playbooks that are continually updated which define in detail the steps we need to take to effectively respond to different incident types. At a high level, our response framework covers:
Incident detection and analysis – the steps we take following initial notifications we receive about a potential incident, including how we confirm whether a security incident has occurred (so that we minimize false positives), through to understanding the attack vectors, scope of compromise, and the impact to Atlassian and its customers.
Incident severity categorization – Once we understand what's happened through appropriate analysis, we use this information to determine the severity of the incident. We designate one of four severity levels to an incident:
|Incident Severity Description|
|0||Crisis incident with maximum impact|
|1||Critical incident with very high impact|
|2||Major incident with significant impact|
|3||Minor incident with low impact|
We use a variety of indicators to determine the severity of an incident – these vary depending on the product involved but will include consideration of whether there is a total service outage (and the number of customers affected), whether core functionality is broken, and whether there has been any data loss.
Containment, eradication and recovery – Considering the incident severity, we then determine and implement the steps necessary to contain the incident, eradicate the underlying causes and start our recovery processes to ensure we return to business-as-usual as quickly as possible. Naturally, the steps we take in this phase will vary significantly depending on the nature of the incident. Whenever it will benefit our customers (or as required by our legal or contractual obligations), Atlassian will also communicate with its customers about the incident and its potential impacts for them during this phase of the incident response process.
Notification - We aim to notify any customer without undue delay if their data is involved in a confirmed incident. This might be light on detail at first, but we’ll provide every detail available when it is available.
A robust post-incident review process – After every incident is resolved, we look at what lessons we can learn from what happened that can inform the development of technical solutions, process improvements and the introduction of additional best practices so that we can continue to provide the best experience for our customers and make the job of malicious actors even harder next time.
Clearly defined roles and responsibilities
Every incident we experience is managed by one of our highly-qualified and experienced Major Incident Managers (or MIMs). MIMs typically make security related decisions, oversee the response process and allocate tasks internally to facilitate our response process. The MIMs are further supported by incident analysts who lead the investigation and analysis of incidents, as well as a range of other roles to assist with the response process. In many cases, if an incident has impact across more than one locale, two MIMs are assigned to an incident to ensure there is always someone accountable to keep our response process moving forward and containment or recovery activities don't get held-up or otherwise affected by time differences.
In the case of very large-scale incidents, there may be cases where a MIM from a different team (normally Site Reliability Engineering) will be called in to help manage the response process. You can read more detail about the roles and responsibilities that we assign when it comes to security incidents.
Access to external experts where required
Sometimes, we may need a helping hand from an external expert to assist us with investigating an incident. We retain the services of specialist cyber security consultants and forensic experts for cases where we may require in-depth forensic analysis or forensic holds for e-discovery in support of litigation.
How we use our own tools to manage security incidents
We use specially configured versions of many of our own products to help ensure we're able to be as methodical, consistent and dynamic with handling incidents as possible. These include:
Confluence – We use Confluence to collaboratively create, document and update our incident response processes in a central location, ensure those processes are disseminated to all staff and can be quickly updated in response to lessons learned based on past incidents. We also use Confluence to document our plays and hunts.
Jira – We use Jira to create tickets for handling both the initial investigation of suspected incidents, and to facilitate and track our response process if our initial investigations confirm an incident has taken place. These tickets help us to aggregate information regarding an incident, develop resolutions, and perform other logistical work (such as delegating tasks as part of the response process and reaching out to other teams within the company where necessary). We also use Jira to track which hunts we execute, and the success or failure of each hunt.
Bitbucket – We use Bitbucket as our source code control tool when we develop code-based solutions to unique edge-case problems that come up with certain types of incidents. The solutions we develop can then be collaborated on internally and tested, while remaining private and facilitating rapid iterations as necessary. We also use Bitbucket in combination with a Continuous Integration / Continuous Delivery plan, roll out code to help mitigate the cause of an incident or aid in the detection or prevention of future incidents.
Ultimately, the use of these tools helps us to establish a response framework that ensures incidents, regardless of type, all begin to have a certain level of structure and familiarity so that we're able to move as quickly as possible to find a resolution.
Atlassian employs a robust and comprehensive approach to handling security incidents, centered around the use of the same tools we make available to our customers. This enables us to respond to incidents with a high degree of consistency, predictability and effectiveness and minimize the potential for damage to our customers, our partners, and Atlassian itself.
Want to dig deeper?
We have published a number of other resources you can access to learn about our approach to handling security incidents, and our general approach to security.