In April 2022, Atlassian experienced an outage and published a post-incident review (PIR) detailing what happened, our response, and what we’re doing to prevent future incidents from occurring. Since then, we’ve made significant progress with the appropriate funding and prioritization, and we’re beginning to see incremental improvements to our processes.
Learn more about this progress since our last update in the four key areas of investment we committed to:
1. Establish universal soft deletes across all systems
Upcoming new control: Site Delayed Deletion
We’re moving into the next stage in our plan to strengthen controls and enhance safeguards around delete operations. We shared last time that we’ve reduced the number of services that can execute deletions to only those that require deletion for business operations, we’ve been working on implementing the next control – Site Delayed Deletion. This feature will suspend a site and the underlying products for 14 days before it’s automatically deleted. During the suspension period, the site and the underlying products are inaccessible to users, and a status indicator in the Admin Hub will alert site admins of any affected products. Site Admins will still be able to request a restoration within the suspension period if the deletion was accidental.
2. Accelerate our Disaster Recovery (DR) program
Disaster Recovery exercises are underway
We’ve continued our commitment to meet both our recovery point object (RPO) standards and the recovery time objective (RTO) standards in our policy. We plan to achieve this through enhanced automation, accelerated multi-product, multi-site restorations, and more frequent Disaster Recovery exercises to optimize our recovery plans.
Teams across the organization participated in our first tabletop exercise focused exclusively on SEV0 conditions and wargamed various DR scenarios while refining roles and responsibilities and processes as needed. Next quarter, the teams will run a technical exercise, simulating a multi-site, multi-product deletion event using synthetic (fake) sites, allowing our teams to perform a hands-on restoration. These exercises do not affect any production customer sites. We’ll continue to run DR exercises, each increasing in scale on a quarterly basis to exceed our RPO and RTO standards.
3. Revise our incident management process for large incidents
New playbook defined, tested, and moving to finalization
We committed to revamping our incident management (IM) process across teams, tools, and documentation. Since our last update, we have:
- created a large-scale incident management playbook and completed 90% of our internal tooling improvements to strengthen our response;
- completed a tabletop exercise, using the playbook across teams to enhance scenario planning and preparedness;
- scheduled an IM simulation exercise in the coming weeks to validate and refine our playbook.
Through these continued iterations, our ultimate goal is to deepen our ability to quickly respond cross-functionally to large-scale incidents and deepen the usage of the playbook within Atlassian.
4. Enhance our incident communications playbook
Implemented Severe Incident Contact Form and other escalation tooling improvements
During the April incident, we were unable to communicate with customers expediently due to the loss of key contact information through site deletions. As shared in our last update, we solved this critical communication error by establishing a new recovery process for accessing key customer contacts in the event of temporary data loss or deletion.
Since then, we’ve been focused on long-term improvements that will inform our large-scale incident management process. To achieve this, we analyzed our current crisis customer communication process and identified gaps in both processes and tools for resolution. In the last quarter, we:
- retooled the Severe Incident Contact Form, which now allows customers to contact technical support without authentication in the event their information has been temporarily suspended;
- unified our customer escalations tooling, which now provides domain-level workflows, refined ticket hierarchies, more advanced tracking capabilities, and faster reporting;
- ran several crisis communications exercises to refine and validate our crisis communications playbook.
These completed actions enhance our ability to quickly communicate and restore customer information in the event of another incident.
This past quarter, the majority of our PIR commitments successfully transitioned from ideation to execution. We expect this progress to continue into 2023 and look forward to sharing more in the new year. Thank you for your trust and partnership.