How Atlassian Manages Customer Data
How Atlassian does Resilience
Our products run on a platform as a service (PaaS) environment that is split into two main sets of infrastructure that we refer to as Micros and non-Micros. Jira, Confluence, Statuspage, Access, and Bitbucket run on the Micros platform, while Opsgenie and Trello run on the non-Micros platform.
To this end, we work to minimise customer impact in the event of any disruptions. We leverage multiple geographically diverse data centers, have a comprehensive backup program, and gain assurance by regularly testing our disaster recovery and business continuity plans.
This page provides an overview of how we manage the overall lifecycle of customer data management, including backups utilizing native capabilities in Amazon Web Services (AWS) to ensure availability of our services, how we regularly test our disaster recovery plans, and our approach to continuous improvement of our disaster recovery and business continuity plans.
How we manage back ups
First things first: Infrastructure and databases
Broadly speaking, Atlassian is split into two main sets of infrastructure where our products run: a platform as a service (PaaS) environment known internally as Micros, and non-Micros. Products running on our Micros platform include Jira, Confluence, Statuspage and Atlassian Access and products running on non-Micros environments include Bitbucket, Opsgenie and Trello. To keep things simple this paper will largely focus on our largest products : Jira, Confluence and Bitbucket.
Jira and Confluence Cloud are hosted in multiple AWS regions, using the AWS infrastructure as a service (IaaS) offering (specifically US-East, US-West, Ireland, Frankfurt, Singapore and Sydney, with plans to expand to other regions as necessary). Jira and Confluence Cloud both use logically separate relational databases for each product instance, while attachments stored in Jira or Confluence Cloud are stored in our document storage platform (“Media Platform”), which is ultimately stored in Amazon S3.
Atlassian realizes that whatever your business does it creates data, and without your data you don’t have a business. In line with our “Don’t #$%! The Customer” value, we care deeply about protecting your data from loss and have an extensive backup program.
For Jira and Confluence Cloud, Atlassian utilises the snapshot feature of Amazon RDS (Amazon Relational Database Service) to create automated daily backups of each RDS instance. Amazon RDS snapshots are retained for 30 days with support for point-in-time recovery and are encrypted using AES-256 encryption.
For Bitbucket, data is replicated to a different AWS region and independent backups are taken daily within each region.
Atlassian tests backups for restoration on a quarterly basis, with any issues identified from these tests raised as Jira tickets to ensure that any issues are tracked until remedied.
For more information, see our Data Storage FAQ.
How we utilise multiple data centers and availability zones for high availability
With hurricanes, earthquakes, and tsunamis all remote, but non-zero, risks, it is imperative that data be backed up (and replicated) to a different geographical locations so that data can be recovered, no matter what happens.
Atlassian does this by utilising AWS’ highly available data center facilities in multiple regions world-wide. Each AWS region is a separate geographical locations, which has multiple, isolated locations known as Availability Zones (AZs). For example, US-West (the West Cost of the United States) is a region, within which there are two AZs, us-west-1a (located in Northern California) and us-west-1b (located in Oregon), both of which are in the same overall region, but are geographically isolated.
Each AZ is designed to be isolated from failures in other AZs, and to provide inexpensive, low-latency network connectivity to other AZs in the same region. This multi-zone high availability is the first line of defence and means that services running in multi-AZ deployments should be able to withstand AZ failure.
Jira and Confluence utilises the multi-AZ deployment mode for Amazon RDS. In a multi-AZ deployment, Amazon RDS provisions and maintains a synchronous standby replica in a different AZ of the same region to provide redundancy and failover capability. The AZ failover is automated and typically takes 60-120 seconds, so database operations can resume as quickly as possible without administrative intervention. These region, AZ and replication concepts are highlighted in the diagrams below. Opsgenie, Statuspage, Trello and Jira Align use similar deployment strategies, with small variances in replica timing and failover timing.
How we determine recovery time and recovery point objectives
In an ideal world, we would never lose any vital business data. In practice though, a system with zero risk of data loss is either unattainable or prohibitively expensive. While culturally at Atlassian, the expectation has been set to aim for this zero data loss scenario and the ability to automatically survive an availability zone failure, in business continuity planning it is necessary to set “recovery time objectives” and “recovery point objectives” (RTOs, and RPOs, respectively) that seek to find the right balance between cost, benefit and risk.
The RTO is the period of time after an incident, in which the business process (or system) should be recovered and back up and running. The RPO is effectively the amount of data the organisation accepts it may lose in a recovery operation. In a simple example, if you take backups daily, if you then have an incident at end-of-day and recover from the backup (which was taken yesterday), you’re going to lose 1 day of data. That’s the RPO.
Our business impact and risk assessments assist our teams in setting custom RTO and RPO targets based on client user requirements and the potential impact of a disruption.
More specifically, we split our services up into easily understandable buckets which we call tiers. Three tiers are defined for products and customer facing services, Atlassian business systems and internal tools (Tiers 1, 2 and 3), and an underlying tier (Tier-0) provides an even higher availability standard for the critical components that everything relies upon.
For each tier, we’ve defined mandatory targets by reviewing, amongst other things, business impact assessments and typical usage scenarios for the services we build. Our service tiers help determine availability, reliability, RTO and RPO targets as set out in the table below.
|Tier 0||Tier 1||Tier 2||Tier 3|
|Critical infrastructure and service components||Our Tier 0 services are those that form the basis of all other services and are critical to delivery of our products.||Our Tier 1 services generally are our products, or directly related to delivery of our products.||Tier 2 services are either non-critical or internal facing.||Tier 3 services are either non-critical or internal facing.|
|Example Services:|| |
· AWS Platform
· Micros Server
· Networking Core
· Jira and Confluence Cloud
· Image Effects
· Receiving analytics and or BI data
|RPO*||<1 hour||<1 hour||<8 hours||<24 hours|
|RTO**||<4 hours||<6 hours||<24 hours||<72 hours|
*RPO – Recovery Point Objective – data loss in event of disaster
**RTO – Recovery Time Objective – services restoration in the event of a disaster
At Atlassian, we designate responsibility to Service Owners for ensuring that the relevant service meets its RPO and RTO target.
How we do disaster recovery testing
Atlassian conducts regular disaster recovery testing and strives for continual improvement as part of our Disaster Recovery (DR) Program. This seeks to ensure that customer data and services are reliable and resilient. We conduct both scheduled and ad hoc testing, including the following elements:
Documentation - For the critical/customer facing services (including Tier 0 and Tier 1), quarterly reviews of backup documentation are undertaken for accuracy and completeness/currency. Any identified issues are documented, and issues result in an internal Jira ticket so that the issue is tracked until it is remediated.
Process - Quarterly tests of actual technical backup/recovery processes are also completed for critical/customer facing services (including Tier 0 and Tier 1), to determine whether RTO and RPO objectives are met (based on service tier classification). Any identified issues flowing from these tests are raised as a Jira ticket so the that the issue is tracked until it is remediated.
Resilience and Failover – Periodic and ad hoc tests for levels of resilience across AZs are undertaken to ensure Atlassian can handle an AZ failure with minimal downtime. While we understand a complete region failure is highly unlikely, we also periodically test region failover and continue to mature our regional resiliency.
Systems - The Site Reliability Engineering (SRE) teams and product engineering teams continuously monitor a wide variety of metrics across the services to help ensure users have excellent experiences. Automated alerts are configured to notify members of the SRE team when certain thresholds for service metrics are crossed, so that immediate action can be taken within our incident response processes.
Disaster Recovery Dashboard - A DR dashboard is maintained internally so that for the critical/customer facing services (including Tier 0 and Tier 1), Jira tickets relating to oversight, maintenance and testing can be tracked centrally to ensure that reviews of documentation and backup/recovery processes are completed on time.
DR Tests and Simulations – DR tests are performed on an annual and ad hoc basis. As part of our DR tests, we perform table top exercises to help the DR teams walk through various scenarios of potential incidents. Table top exercises test different scenarios and identify gaps in our recovery processes. Scenarios for table top exercises include earthquake, fire, natural disaster, recovery drills and tests. After DR tests are performed, outputs of the tests are captured, analysed and discussed to determine the scope of the next steps for continuous improvement. The improvement efforts are captured within a Jira ticket and tracked until remedied.
Atlassian realises that whilst our testing and processes are technically rigours, we still set the standard of having exceptional people bringing it all together. Accordingly, Atlassian includes the following people elements in our DR Program:
Site Reliability Engineers (“SREs”) – SREs are committed to ongoing periodic DR meetings and represent their critical services. They identify DR gaps with our risk and compliance team, focusing on remediation as necessary.
Disaster Recovery Champions - DR champs are appointed within each product/service team (including underlying services) to oversee and help manage the implementation of DR within that product/service to ensure it meets service tier requirements.
Leadership - we maintain the involvement and ongoing engagement of executive and senior management in our DR processes. With leadership involved, Atlassian has both business and technical drivers accounted for in its strategy for resilience.
Other broader business continuity measures and plans
Atlassian strives to maintain strong Business Continuity (“BC”) and DR capabilities to ensure that the effect on our customers is minimised in the event of any disruptions to our operations. The key principles guiding our BC and DR program include:
Continuous improvement – Atlassian strives to ensure improvements to resilience grow through operational efficiencies, automation, new technologies and proven practices.
Assurance through testing – Atlassian understands that through regularly scheduled testing and the application of continual improvements, we are able to achieve optimal resiliency.
Dedicated resources – Atlassian has dedicated people and teams to ensure our customer-facing products get the attention they need to make the BC and DR possible. Atlassian have the right level of resources on the ground to support our steering committee, risk assessments, business impact analysis testing, and of course real world incidents.
Atlassian combine best in-class technologies and on-going testing and validation to ensure our customer data is highly available, reliable and resilient. We operate multiple geographically diverse data centres, have an extensive backup program, and gain assurance through regularly testing disaster recovery and business continuity plans. To top it all off, we have exceptional people and dedicated resources bringing our processes together.