Atlassian’s Five-Alarm Fire: How We Evolved Our Cloud Reliability Practices

Over the last five years, Atlassian has gone through an incredible transformation. We began as a company that built on-premise software and grew to a cloud-first business with 15 million monthly active users hosted on our cloud products. Our cloud customers include enterprises with increasingly high expectations for reliability. Our tools have helped NASA teams design the Mars Rover and Cochlear teams develop hearing implants – a testament to how our tools help build tomorrow’s innovations.

However, as we transformed to an enterprise-scale cloud company, an important step in our growth involved a figurative product update drought and subsequent fire. This is the story of our "Five-Alarm Fire"* and how it accelerated our journey towards cloud reliability. I'll share the crucial lessons that we learned along the way and how they might be relevant to your company.

*A "Five-Alarm Fire" is a phrase used in the US for a large, intense fire that requires an enormous response.

Contents

Project Vertigo: Migrating and rearchitecting our Jira and Confluence infrastructure on AWS

During the three years leading up to our "Five-Alarm Fire" in 2019, the majority of our engineers were focused on rearchitecting our largest products, what we called Project Vertigo internally. At the time, this was the largest technical migration challenge that we took on as a company.

Project Vertigo involved moving all of our Jira and Confluence customers from a single to multi-tenanted architecture built on a microservices-infrastructure running on Amazon Web Services (AWS). It also involved major changes to the foundations of Jira and Confluence, which are built from millions of lines of code. In order to complete the Vertigo project within our two-year goal and with very minimal customer friction, we had to pause almost all of our new feature development outlined on our roadmap.

With Project Vertigo complete, feature demand exploded

As Project Vertigo progressed, we understood that there was pent-up demand for new features, from both our customers and our internal product management and marketing teams.

Once we completed Project Vertigo, we reinvested in product development. We spent the next year deeply focused on building new features requested during Project Vertigo.

Entropy and growth can be challenging

Following Project Vertigo, I, along with my engineering leaders, began to notice some challenges at our quarterly engineering operations review that began to contribute to a small but noticeable trend of problems. These included:

Unintended complexity

When we moved from a small number of monolithic codebases to many more distributed systems and services powering our products, unintended complexity began to creep in. We struggled to add new capabilities with the same velocity and confidence as we had done in the past.

Ramping up new engineers

As we onboarded a large number of new engineers, we saw certain challenges related to quality and reliability. Most of our new hires hadn't yet mastered our entire codebase, which caused short-term challenges with inadvertent side effects.

We realized that we hadn't equipped our engineers with the proper training, tools, and fail-safe processes that help build confidence when moving quickly. As we transitioned from building server software to cloud and distributed systems, we ramped up hiring for engineers with cloud expertise and began training current engineers on how to build cloud systems. In the short term, this meant that we didn't yet have engineers with cloud expertise spread evenly across all of our teams.

Technical debt from our feature drought

By immediately addressing the pent-up demand for new features, we didn't allocate enough time on our roadmaps to address technical debt which had accumulated during Vertigo. This included the need for reducing technical complexity, improving observability, and addressing root causes when problems occurred.

Entropy that also affected rituals and processes

We believed we had set up systems and practices to identify problems before they occurred. However, our systems and rituals had suffered from entropy and hadn't been adapted to keep up with our rapid pace of growth and transformation.

High-fidelity observability that became table stakes

Some of our service monitoring and reporting practices were not yet as in-depth as we wanted them to be. For example, we made the mistake of only looking at averages of key metrics rather than also looking at the most important 90th and 99th percentiles. We hadn't yet mastered the critical skills of cloud observability. Put together, these problems caused us to raise the alarm.

We declared a "Five-Alarm Fire" to rally our organization

As CTO, a big part of my role is to act as a change agent and executive sponsor for large, transformational initiatives. Given the circumstances, I decided it was time for action.

At our next engineering Town Hall meeting, I announced that we were in a "Five-Alarm Fire" due to poor reliability and cloud operations. We were not in alignment with our company value – Don't #@! % the Customer – and I knew it was pivotal to rally our team. Our customers needed to trust that we could provide the next level of reliability, security, and operational maturity to support our business transition to the cloud in the coming years.

Scaling a company involves a clear process. One of our four engineering philosophies – creating radically autonomous, aligned teams – was at risk. We needed to move towards an automated, well-operationalized approach in order to meet our goals for customers.

While we had just celebrated an incredibly successful cloud migration, this was a wake-up call for our teams. Up until that moment, we didn't have a galvanizing event to prompt action, so the "Five-Alarm Fire" marked the beginning of a new chapter for our organization.

We formed an expert team and aligned the organization

When we start a new initiative at Atlassian, it begins by creating an expert team. I chose to bring together three of our organization's reliability experts Mike Tria, Zak Islam, and Patrick Hill to lead the effort. Their first step was to run a project kickoff meeting with a small group and then expand it to all of our engineering managers.

This was a pivotal moment to help everyone understand the context and goals, our sense of urgency, and alignment on the next steps. They created a "war room" to develop a shared understanding of the problem and establish our common goals.

Mike, Zak, and Patrick took charge and shifted our organization's focus to get everyone on the same page. They created clear communication channels across all levels of the engineering organization via Town Hall meetings, company-wide internal blogs, department-level meetings, and team meetings. Everyone was aware of our acute focus, which was critical in aligning the organization to our new mission – cloud reliability. They also got buy-in from Atlassian leaders across the company, so everyone was on the same page about our top priority until the Five-Alarm Fire ended.

In order to raise the visibility for our goals, signals, and measures, they introduced a new company-wide reliability dashboard. It tracked information, including:

Real-world reliability metrics

Previously, we relied on synthetic availability, measuring whether a basic health check was successful. During this time, we saw our customers identify too many of our incidents before our monitoring tools triggered alarms. We understood that we needed to update our practices and switch to a more accurate measurement of reliability, so we adjusted to evaluate data that reflected how our customers were using our software. To address this, we created a new top-level metric to score our ability to detect incidents before our customers did. This drove significant progress towards improved monitoring.

Time to restore a service

We also raised the priority of a key measure that quantified how quickly we could restore service following an incident. Our aim was to focus on how quickly our teams could detect and recover from problems by using better tools and automation. During this time, teams built custom tools and scripts that would enable certain problems to be detected and corrected automatically, before our customers were impacted.

Incidents are unavoidable, so we choose not to measure teams directly based on the number of incidents they experience. Instead, we celebrate teams with the best detection and recovery mechanisms.

Time to resolution

After every incident, we identify which actions are needed to fix the root causes of problems in order to prevent future occurrences. At times, fixing these root causes wasn't always prioritized above day-to-day feature development. To correct this issue, we began more clearly measuring the average timeframe until all post-incident actions were completed and communicating where we could accelerate this timeframe. Our goal was to create incentives for teams to prioritize this work.

Incident recurrence rate

Our SRE team categorizes incidents based on whether the cause has contributed to other incidents in the past or not. Our incident recurrence rate is a strong signal of whether we’re truly addressing root causes, so we decided to add this metric to our company-wide dashboard permanently.

Sharpening our focus on security and reliability

One of the most important things that we introduced at this time was a set of rules to align priorities across our engineering organization. We agreed to sharpen our focus on security and reliability across engineering, product management, marketing, customer support, our executive team, and all other organizations. This agreement had a huge impact across hundreds of teams as they prioritized various projects each day. It provided them with a common framework to help prioritize decisions, while allowing them to remain autonomous. This enabled us to return to functioning as loosely coupled, but highly aligned teams. We initially started with three prioritization rules:

  1. Security
  2. Reliability
  3. Everything else

In the subsequent years, we added further clarification. Our current guidelines for prioritization, excerpted from our handbook, are as follows:

  1. Incidents: Resolve and prevent future incidents, because our products and services are critical to our customers' ability to work.
    (a) If we experience multiple high-priority incidents and we need to distribute resources to address them, we will prioritize them under the following guidance:
    • i. Security incidents
    • ii. Reliability incidents
    • iii. Performance incidents (which we now consider part of reliability)
    • iv. Functionality incidents
    • v. All other categories of incidents
  2. Build/release failures: Resolve and prevent build/release failures, because our ability to ship code and deploy updates is critical. If we’re unable to build and/or release new software, we’ll be unable to do anything else.
  3. SLO regressions: Triage, mitigate, then resolve out-of-bounds conditions with existing service level objectives (SLOs), because without meeting our SLOs we’re losing the trust of our customers.
    (a) If multiple SLO regressions are competing for attention, security regressions are always given priority.
    (b) The prioritization for SLO regressions include (in priority order): security, reliability, performance, and bug SLOs.
  4. Functionality regressions: Fix regressions in functionality that have been reviewed and approved by your product manager.
  5. High-priority bugs and support escalations: Resolve bugs in agreement with your product manager and resolve customer escalations, because they impact customer happiness.
  6. All other projects: Everything else, including roadmap deliverables, new objectives, and OKRs.

Developing this clear set of prioritization guidelines represented a major shift for our company. It set us up to evolve to a cloud-first business with enterprise-grade security and reliability at the helm of our products. As a result, our customers know they can rely on our products to get work done.

A roadmap to improved reliability

Mike, Zak, and Patrick led the development of a roadmap to reliability. It included many smaller projects for achieving and sustaining our reliability goals. Then, the team operationalized these plans, established new engineering culture norms, and established tools and systems to create long-lasting change. These are some of the notable projects that contributed to our goals.

Five-Alarm Fire to-do list

Immediately after the Five-Alarm Fire began, the team released a "Five-Alarm Fire to-do list" that was used by every engineering team to help expose and correct weaknesses. All teams were asked to postpone other work to dedicate time to work through this checklist. Examples of tasks in the checklist were:

  • Complete any open follow-up actions from any previous incidents.
  • Perform an immediate review of how your team deploys and rolls back changes to ensure the process has no manual steps and that the team has total confidence in the success of both deploys and rollbacks.
  • Enable our internal chaos tools on all services in production to ensure fault tolerance.
  • Bolster "real user monitoring" for core product capabilities like logging in, viewing a Jira issue, creating a Bitbucket pull request, publishing a Confluence page, and all other core capabilities of the product.
  • Review the change-related incidents you had in the last 6 months, fix the root causes, and implement automated tests to prevent future occurrences.
  • Schedule weekly operational health check meetings to review metrics, address anomalies, and serve as a handoff point between outgoing and incoming on-call personnel.

Architecture and operational reviews

Peer reviews are one of the best tools to improve the quality of engineering output. Having an experienced developer or architect review your work is always a good idea because we all get better by learning from each other, and our standards go up when our work is scrutinized.

Before the Five-Alarm Fire, we were confident because these rituals were in place. But after further scrutiny, we realized that the bar wasn't as high as we thought it was. We noticed some architectural and operational reviews were not as thorough as we believed they were.

As a result, senior engineers and managers became responsible for uncovering entropy in existing processes. They were incentivized to ensure the original intent wasn't lost and the bar wasn't getting lower over time, as can be the case as time goes on. They also have an ongoing responsibility to stress-test plans to ensure that we are seeing around corners and anticipating the unexpected.

As part of our roadmap to reliability, we added additional peer reviews to our normal engineering processes, including:

  • Architecture reviews for any new components or services to ensure our fault-tolerance patterns were being applied.
  • Pre-production operational reviews with our SRE and security teams for any new services to stress-test data integrity & recovery, monitoring, alerting, logging, on-call plans, security, deploys, rollbacks, and other critical aspects of operating reliable services.

Shared incident values

We doubled down on our shared values for blameless incident management and leveraged this document to both train new employees and refresh existing employees. We strive to create as much team and individual autonomy as possible, and having a set of values empowers your people to act when faced with novel situations. Our philosophy is to hire great people from diverse backgrounds and empower them with shared values to guide their decisions and actions.

Automation, tools, dashboards, and reporting

As part of our roadmap to reliability Mike, Zak, and Patrick realized that we needed better tools to eliminate toil, increase observability and fidelity, and adopt more standardization across our teams and products. We built mechanisms and automation to ensure that we didn't have to solely rely on the vigilance of our people regularly looking at charts and dashboards, since that wasn't sustainable.

Instead, we invested in tools and bots to alert us when our KPIs breached a threshold or if a team was accumulating a backlog of reliability and security tasks, due to underinvestment in those areas. In a growing company like ours, standardization is necessary in order to work efficiently at scale. So, we standardized a consistent data pipeline for observability metrics, reports, and alerts. This also enabled us to apply new levels of intelligence to the data to gain deeper insights.

Identifying mission-critical metrics to unlock organizational change

An important lesson I'd like to share with other organizations navigating similar transformations is recognizing that if you want something done, you have to develop an automated report and/or alert for it and shine a light on those key metrics. The corollary is also true – you should only build alerts and dashboards for the things you plan on investing in. Too many dashboards, metrics, and alerts can negatively impact your signal-to-noise ratio and create a loss of enthusiasm from teams who become overwhelmed.

We also learned high-fidelity metrics can really improve your team's understanding of the situation. For example, on Bitbucket, we see a phenomenon whereby the vast majority of API clients, mostly CI systems, poll our APIs almost exactly in unison, aligned with network time synchronization driven by cron jobs. The impact of these spikes would be impossible to visualize on reports that show data averaged over 5-minute intervals. Therefore, it's important to employ higher-resolution monitoring to get an accurate picture of how your systems are functioning. Anecdotes like these remind me of how important high-quality observability tools are.

Recurring operational reviews

One of the hallmarks of agility is iterating through a "build, learn, adapt" loop. We took that and applied it to our site reliability and service operations. The concept is to develop an institutional muscle where your teams regularly review anomalies and react to the signals coming from your metrics. This helps stave off future problems at both weekly and quarterly intervals.

Our engineering leaders stay close to our service operations, looking for trends that may point to problems with underinvestment, staff burnout, and other factors that impact reliability. These reviews also help raise the reliability responsibilities from a small number of individuals to raising them as a concern at every level. Our recurring operational reviews include weekly operations health checks and quarterly operational reviews. You can read about them more in-depth in our handbook.

Outcomes of our “Five-Alarm Fire”

Looking back on the Five-Alarm Fire, it's easy to recognize what a transformational moment it was for us. We saw teams gain a more acute awareness of the need to prioritize security and reliability across the company.

An improved culture of reliability

Our engineers now regularly post internal blogs celebrating their victories and new achievements in improving our reliability metrics. Within the software industry, the engineers who gain the most recognition are often those who own new, exciting feature developments. These days at Atlassian, equal (if not greater) accolades are given to those who make a positive impact on increasing reliability. Our promotion criteria have also shifted in line with our updated engineering priorities, where security and reliability are placed above all other work.

The "X alarm fire" framework: A mechanism for managing regressions

Our Five-Alarm Fire started off as an ad hoc initiative that eventually took on a life of its own. In order to sustain the achievements we made during that time, our team developed a systematic way of dealing with new challenges that emerge. We refer to this as our "X alarm fire" framework.

The framework enables teams to sound the alarm, putting the onus on every engineer. They are able to pause some or all of their other work, depending on the severity, and spend the necessary time focused on returning to the acceptable ranges for all of their key metrics. The "X" in "X alarm fire" refers to the severity from 1 to 5. It determines the scale of the regressions, where a 1-alarm fire is small and localized to a team, and a 5-alarm fire is an event that occupies the majority of Atlassian.

This framework helps to empower teams, equipping them with more autonomy and accountability. It allows our organization to self-correct when signals are trending up, without top-down management.

Looking back, what was the impact?

This chapter in our evolution yielded results that make all Atlassians proud. We've launched Atlassian Cloud Enterprise with an SLA that comes with our highest uptime guarantee, backed by a 99.95% SLA.

Initially, we saw a spike in the number of incidents we were experiencing because more critical bugs and vulnerabilities were being recognized as incidents, triggering a better response and more thorough post-incident review. This spike was positive since it meant we were collectively raising our bar for reliability. Over time, this leveled out and now we're seeing fewer high-severity incidents, faster detection, more automated recovery, and happier customers.

To this day, we continue to use our "X alarm fire" framework to course-correct, empowering every engineer to sound the alarm and identify how to address new challenges that arise. We know the work doesn't stop here, and we're proud of the progress we've made thus far.

Insights that you can apply to your organization

  • Company transformation requires executive prioritization: Company transformations require a senior executive sponsor that is accountable for the results and is willing to enforce the necessary tradeoffs. This person also needs to create empowerment and autonomy to enable their organization to invest in new tools, systems, and processes to make improvements permanent.
  • Documentation is pivotal: Engineering priorities need to be documented and accepted across all levels of an organization.
  • Improving reliability requires a systemic overhaul: Reliability requires systems, tools, processes, and a willingness to invest in a systematic approach to improving reliability and maintaining those high standards.
  • Always "trust and verify": Use peer reviews beyond code reviews, and leverage experienced people to uncover potential problems in architecture and operational maturity.
  • Stress-test all systems: Require that senior engineers and leaders take responsibility to stress-test plans and uncover entropy in tools, processes, and systems, as well as require that they be accountable for regressions in reliability. Don't be lulled into a false feeling of safety just because you have systems and rituals in place; they are only as good as their last review.
  • Good intentions don't work at large scale: Scaling a company involves a shift in mindset from relying solely on the good intentions of our employees towards an automated, well-operationalized approach in order to get the difficult things done.
  • Invest in observability: Invest in tools that enable high-resolution insights into your key operational metrics, and don't overdo it with too many metrics that can overwhelm your teams.

Want to learn more and contribute to our TEAM?

If you’re interested in joining our engineering rocketship, visit our Careers page to find out more about our open roles and how you can contribute to our TEAM.

Want to learn more about our engineering best practices? Dive a layer deeper within these resources: