Taming Test Flakiness: How We Built a Scalable Tool to Detect and Manage Flaky Tests

Flaky tests erode trust in CI and waste thousands of engineering hours. This article explains how Atlassian built Flakinator, a scalable, stack‑agnostic platform that automatically detects, quarantines, and manages flaky tests across millions of daily executions.

Introduction

A year ago, we began addressing flaky tests to enhance the Continuous Integration (CI) experience within our monorepo. A manually managed file-based system was in use; however, it presented several challenges, including a complex workflow, no customisations, no actionability, a single point of failure, and difficulties in scaling. As we progressed in enhancing our CI ecosystem, it became essential to establish a system that is effective, scalable, and configurable. This system must be easily adoptable and designed to minimise friction in developers’ workflows. In response to this, we developed a platformised, tech stack agnostic tool, designed to detect, manage, and mitigate flaky tests across all of our codebases effectively: Flakinator Flakinator.

Before we delve into our solution, it is crucial to grasp the problem’s intricacies and underlying significance.

What are Flaky Tests?

Flaky tests are the bane of any software development team. They fail sporadically without any changes to the underlying code, leading to mistrust in test results, wasted debugging efforts, and disruptions to CI/CD pipelines.

Figure 1: Example of a flaky test

Hidden Cost of Flaky Tests?

Non-deterministic behaviour that leads to random failures creates inefficiencies, forcing developers to repeatedly run builds. This not only consumes valuable engineering hours spent troubleshooting tests that should ideally yield consistent results, but it also diminishes developer satisfaction.

Why is it a Big Deal?

Flaky tests are a well-documented problem in the software development lifecycle (SDLC), and several studies and industry insights highlight the severity of their impact.

Key Quotes from Research

Introducing Flakinator

Flakinator serves as an essential offering for our Atlassian products, enabling teams to focus on delivering features and improvements rather than being bogged down by the unpredictability of flaky tests.

Relentless pursuit of flaky tests and their ultimate elimination

Flakinator Capabilities

Efficient Identification

Utilize advanced algorithms and machine learning technologies to efficiently identify flaky tests

Quarantine Mechanisms

Provides an ecosystem to isolate flaky tests from CI pipelines, while still tracking them for review and resolution

Trend Analysis and Reporting

Dashboards which offer insights into trends and patterns over time

Root Cause Analysis

Diagnostics that flag flaky tests and provide in-depth analysis and actionable insights

Custom Settings

Enable teams to set thresholds and priorities based on their own product requirements for identifying flaky tests.

Scalable and Performant

Ensures that it can adapt to evolving needs of products, making it a long-term investment in quality and efficiency

User-Friendly Experience

Intuitive dashboards and easy setup, designed to be accessible to teams of all sizes

Collaboration and Communication

Seamlessly integrates with other tools like Jira and Slack, enabling prompt notifications and fostering a collaborative environment for tracking and resolving flaky tests.

Smooth Integrations

Easily fit into existing CI/CD ecosystem, ensuring transition doesn’t disrupt current workflows.

Design Overview

Flakinator sits in our CI infrastructure, expecting the test run data to be ingested through CI. The ingested records undergo transformation, with raw test data being stored for future use. Various detection mechanisms are implemented for different products to identify flaky tests within the system. Multiple consumers utilise this information, tailoring it to meet their specific needs and visualisations.

Figure 2: Flakinator Ecosystem

How the Ecosystem Works

A test is flaky when it produces inconsistent results across runs with the same code, passing sometimes and failing other times without code changes. Flakinator uses several algorithms to detect flaky tests. After detecting a flaky test, the code ownership system identifies its owners, creates Jira tickets with deadlines to resolve them, and sends Slack notifications if configured.

We collect signals for quarantined tests by running them in branch builds, scheduled jobs, or quarantine pipelines to gather results with the latest code changes. Test health is calculated from collected results. If a test remains healthy for a configured period, we remove it from quarantine and reintroduce it into the system. Monitoring and dashboards track quarantined and unquarantined test volumes. Once teams have resolved the underlying issues, the flaky test tickets are resolved and tests are and reintegrated into the system.

How It All Comes Together

Flakinator is built on a scalable, distributed architecture to handle the large volume of test data across multiple Atlassian products. Here’s an overview of the key components:

Figure 3: Flakinator Architecture Diagram

Detection Algorithms

RETRY detection mechanism –

Rerun the failing test in the same build and use that data as a detection method to find flakes. The Flakinator CLI, integrated into the pipelines, checks whether failing test cases are already designated as flaky. If a test is not included in the flaky list, an implicit retry mechanism is employed to collect flaky signals, with the circuit breaking at the first occurrence of a flip signal. The number of retries is configurable and varies depending on the test type. When flip signals are received, newly identified flaky tests are logged in the database to enhance the efficiency of future builds. This approach has enabled us to achieve an impressive 81% detection rate for certain products.

Figure 4: Rerun detection workflow

For a test case history like below, the yellow ones are flakes. This information is the signal that we use for quarantining a test.

Figure 5: Sample test run results

Bayesian Inference for Flakiness Detection

Bayesian theorem is a theorem in Statistics which provides a formula to calculate the Probability of an event A happening given that event B has already occurred. In other words, it’s used to update the probability of a hypothesis based on new evidence

Figure 6: Equation

Conditional probability is the likelihood of an outcome occurring based on a previous outcome in similar circumstances. Bayes’ theorem relies on using prior probability distributions in order to generate posterior probabilities.

Bayesian inference

In Bayesian statistical inference, prior probability is the probability of an event occurring before new data is collected. Posterior probability is the revised probability of an event occurring after considering the new information.

For the use case of creating a flakiness score for a test case, we use the prior probability distribution of a test case’s historic runs and create the posterior probability from it. The analysis/ inference component consists of 3 modules

Figure 7: Flakiness score detection workflow in action

An example of a low-quality test case where the test shows indeterministic results in CI across multiple commits

Figure 8: Low-quality test with a bad quality score

Results and Impact

Since deploying Flakinator, we’ve seen significant improvements in CI build stabilisation across our engineering products. This tool is currently utilised by over 12 products within Atlassian.

Flakinator has significantly impacted various Atlassian repositories by proactively detecting flaky tests and quarantining them to prevent build failures. This approach saves build minutes and reduces costs, resulting in substantial savings. It has an alerting and notification mechanism that notifies the team when it detects a flaky test and creates a actions for the team to own and fix the test to restore it to the system. As of the last quarter, Flakinator successfully recovered more than 22,000 builds and identified 7,000 unique flaky tests leading to considerable cost savings.

This tool enhances build reliability, conserves development hours, and reduces CI resource consumption by minimising the need for test reruns, ultimately accelerating time to market.

Figure 9: Snapshot of Builds recovered by Flakinator

Metrics give teams visibility into their quality and performance tracking key indicators, for example, tracking the flaky test rate at a team level highlights which teams are contributing the most to pipeline instability, motivating them to prioritise fixing flaky tests. Data-driven insights also help teams and leadership forecast the effort and time required for specific tasks, such as reducing test flakiness in the packages owned by them.

Figure 10: Flaky tests per team


Figure 11: Flaky test types per team


Figure 12: Flaky tests per team

Lessons Learned

Building a flaky test management system wasn’t without its challenges. Here are some of the key lessons we learned:

  1. Data Quality Matters: Inconsistent or missing test metadata can lead to inaccurate flakiness detection, so it’s critical to invest in reliable data collection.
  2. Iterate on Algorithms: No single algorithm works universally. Combining heuristics, statistical methods, and machine learning provided the most accurate results.
  3. Prioritise Developer Experience: A tool is only effective if developers use it. We focused heavily on building an intuitive UI and smooth integrations with existing workflows.

Future Plans

We are continuously improving our flaky test management tool. We plan to use machine learning algorithms to improve its ability to predict outcomes, identify patterns, and forecast issues. Our goal is for the tool to automatically fix flaky tests by addressing common problems such as timeouts, mocking failures, and environmental dependencies.


Flaky tests are an inevitable challenge in large-scale software development, but they don’t have to derail your CI/CD pipelines. By building a robust flaky test management system, we’ve improved build reliability, streamlined developer workflows, and saved resources across Atlassian.

We hope this blog inspires you to tackle test flakiness in your own organisation. Thanks!

Exit mobile version