Atlassian’s approach to resilience
Keeping your cloud products and the underlying systems and services they use available and able to withstand the impact of negative or unplanned events is as crucial to us as it is to you. To make sure that your products are there when you need them, we’ve implemented technology, people, and programs to provide business resiliency.
Building resilient products
Atlassian operates our cloud products under a shared responsibility model - meaning that achieving reliability is a partnership between you and Atlassian. Under this model, we’re responsible for ensuring the high-availability, reliability, and recoverability of our infrastructure, products, and services. it’s your responsibility to implement a disaster recovery program and business continuity plan that ensures that you’re able to operate your business in the event of an unplanned event.
We use Amazon Web Services (AWS) as a cloud service provider and its highly available data center facilities in multiple regions worldwide. Each AWS region is a separate geographical location with multiple, isolated, and physically-separated groups of data centers known as Availability Zones (AZs).
Each availability zone is designed to be isolated from failures in the other zones and to provide inexpensive, low-latency network connectivity to other AZs in the same region. This multi-zone high availability is the first line of defense for geographic and environmental risks and means that services running in multi-AZ deployments should be able to withstand an AZ failure.
To learn more, read the architecture and operational practices page.
Atlassian is dedicated to making sure that all of our teams provide reliable services and products. To do this effectively, our disaster recovery (DR) program is focused on implementing processes, policies, and technologies that ensure critical IT systems and services are available, reliable, and can quickly be restored in the event of an outage.
In addition to the capabilities noted above, we’ve implemented monitoring and alerting and run disaster recovery tests.
Monitoring and alerting
We continuously monitor a wide range of metrics with the aim of detecting potential issues early. Based on those matrices, alerts are configured to notify site reliability engineers (SREs) or the relevant product engineering teams when thresholds are breached so that prompt action can be taken through our incident response process.
SREs also play a key role in in the DR program by working our risk and compliance team to align with compliance frameworks. Each of our teams also include a DR champion to oversee and help manage disaster recovery aspects related to that team.
Disaster recovery (DR) tests
Our DR tests cover process and technology aspects, including relevant process documentation and failover tests on our systems. These tests range from standard tabletop simulation exercises to full scope availability zone or regional failover tests. Regardless of the complexity of the test, we are diligent in capturing and documenting test results, analyzing and identifying possible improvements, and then driving them to closure with the help of Jira tickets to ensure continuous improvement of the overall process.
Ensuring reliable services
We prove our commitment to reliability through our service level agreements (SLAs), which define the amount of uptime we need to guarantee to our customers each month.
In addition, we also use other measurements, such as recovery time objectives (RTOs) and recovery point objectives (RPOs). In the event of an unplanned event impacting the reliability of Atlassian’s cloud products, Atlassian will aim to restore normal operations to its cloud products in accordance with the following RPO and RTO:
To view the availability of our products and services, visit our Statuspage.
Our highly-available (HA) architecture allows us to restore service in the case of most disruptions that could impact the availability of our cloud products. There are some scenarios, however, that require us to use more traditional data backup and recovery mechanisms, such as data corruption or deletion within our infrastructure.
To address these scenarios, we operate a comprehensive backup program at Atlassian. This program includes our internal systems and our cloud products, where our backup measures are designed in line with system recovery requirements. We have processes and tools in place that continuously test backups.
However, these backups are not used to revert customer-initiated destructive changes, such as fields overwritten using scripts, or deleted issues, projects, or sites. To avoid data loss, we recommend that you take regular backups. Learn more about creating backups in our documentation.
Minimizing the impact of unplanned events
Atlassian’s Business Resilience team works to ensure that our own essential functions remain operable during and after a business disruption through sound Business Continuity (BC) practices.
The BC program is designed to work together with our DR program and our activities are based upon an annual lifecycle that’s aligned to industry standards. As part of our approach, we conduct our business impact analysis (BIA) process, at least annually, which is the foundation of building effective continuity strategies necessary to protect our people, processes, and technology. The output of these BIAs directly assists in driving the strategy for DR and BC efforts. As a result, our critical business services are able to holistically develop effective DR and BC plans that assist in both the recovery of our essential technology as well as the people and processes behind it.
Our approach to Business Continuity assurance
We continually seek to build capability and assurance of our Business Resilience and recovery strategies through three complementary approaches:
- Exercises: Seek to review existing plans and can be tabletops, functional, or full-scale and give everyone who plays a part in the plan the opportunity to practice their responsibilities in case of a business disruption. It allows stakeholders to review relevant continuity plans in detail and follow the procedures as they would in a real crisis.
- War-games: Allow us to stress test our response to an existing or possible threat. While we utilize an all-hazards approach to planning, war-games allows us to pressure-test our approach to specific highly-probable or impactful scenarios to ensure our response and recovery strategies are robust.
- Tests: Are pass/fail and allow us to objectively measure whether our plans are effective. This is our predominant approach when we seek to test our disaster recovery strategies so that our effectiveness can me measured and managed.