Site icon Work Life by Atlassian

Automating Mutation Coverage with AI: Our Journey and Key Learnings

How it all started

Early last year (2025), I spent a few hours manually improving mutation coverage for one class.

Fast-forward to team’s innovation week (Sept 2025), where AI was the focus. I began experimenting with Rovo Dev CLI, asking a simple question:

“Can an AI assistant read a mutation report and actually write the right tests for me?”

What began as a small experiment quickly turned into a reusable Mutation Coverage AI Assistant that multiple teams now use to reliably push projects towards 80%+ mutation coverage—without drowning in manual analysis and boilerplate tests.

This blog is the story of that journey:
why we built it, how it works, what impact it had, and what we learned along the way.


Why we needed an AI solution

Improving mutation coverage is critical for test quality—but painful at scale.

Why mutation coverage is hard (especially at project level)

Why LLMs are a great fit here

We realized this problem is a sweet spot for LLMs:

So we asked: What if, instead of prompting an LLM multiple times, we gave it a complete end-to-end workflow to follow?


Meet the Mutation Coverage AI Assistant

The Mutation Coverage AI Assistant is a set of workflow instructions that guide an LLM through the entire process of closing mutation coverage gaps for a project.

Instead of you doing everything manually, the assistant:

Mutation Coverage AI Assistant launched via Rovo Dev CLI

You can think of it as a Dev-in-the-loop AI copilot specifically tuned for improving mutation coverage.


How is this different from just using other AI tools or manual prompts?

Many of us have already tried:

The Mutation Coverage AI Assistant is designed to go beyond that.

Comparison of approaches

SolutionManual Prompts to LLMsTest tools in AI plugins (e.g., /tests)Mutation Coverage AI Assistant
this solution
DescriptionYou manually craft prompts for each step to improve mutation coverage across a project./tests (or similar) auto-generates tests for a given class.Load the AI assistant using Rovo Dev CLI and then provide inputs as requested.
How it addresses mutation coverage gap for a projectEasy to start, but every step requires new prompts. LLM behavior is unpredictable with no guided workflow, so time to close gaps can be high.Generates tests, but has no awareness of mutation reports before/after. Tests may not meaningfully improve mutation coverage.Follows a guided automation approach: generates mutation reports, recommends which class/package to target, and supports overrides. Designed to reduce total time to close project-wide gaps.

Why it works across different projects

Adaptive on different projects

The assistant is project-aware:

This means:

Test Patterns identified for a sample project

Dev-in-the-loop AI Assistant (Guided Automation)

We deliberately chose Dev-in-the-loop over full autonomy.

Improving mutation coverage across projects owned by multiple teams requires human judgment at key steps:

If we had let the AI run fully autonomously across “x” classes:

With Dev-in-the-loop:


Auto-validation of tests

One of the most important lessons:
test generation is not enough—you must check if coverage actually improves.

The workflow explicitly instructs the LLM to:

  1. Implement tests.
  2. Rerun mutation tests.
  3. Verify whether mutation coverage % improved.

This auto-validation step:


From experiment to real impact across teams

Mutation Coverage AI Assistant Adoption Process

Early Adopters

We first used this assistant on few selected projects to get early feedback on its usage.

Positives

Improvement areas

Despite those rough edges, the impact was clear:

Mutation coverage improvement data

ProjectBeforeAfter
jira-project-a5680
jira-project-b7088
jira-project-c8396
jira-project-d7180
jira-project-e8490

Expanding to more projects

As confidence grew, we opened it up to more teams:

For a migration project in Jira, which had hundreds of classes, ~1500 mutants were killed using this solution with significantly less time spent. This became a strong proof point that the solution could handle large projects.


Lessons Learned

1. Larger projects: High mutation report generation time

Problem

On large projects with many classes/packages, generating a mutation report for the entire project could take >10 minutes.

What we changed:

Enhanced the workflow to support package-level mutation analysis:

This shift significantly improved adoption on large projects.

Also, enabled multi-threading for faster mutation report generation.

2. LLM limitations: when instructions weren’t strict enough

Problem

Without crystal-clear instructions for test implementation step, the assistant sometimes generated problematic changes:

What we did: tighten the workflow instructions

We updated the workflow to include explicit guidelines to avoid these mistakes.

Result:

3. Generic “kill mutants” instructions led to unwanted, redundant tests

Problem

With generic instructions like “Write tests to kill mutants, to improve mutation coverage”, the assistant sometimes generated excess tests to kill the remaining mutants.

This caused:

What we did: Added detailed, mutant-specific test-writing guidelines

To address this, we:

Result:

4. Bloated classes: AI can’t fix design issues

Challenge

For classes with too many responsibilities:

Learning

Observation:

Learning:

5. Enhancements around the clock: prompts as evolving artifacts

Challenge

The first version of the prompt instructions was far from perfect.

What we did: continuous improvement

This collaborative model turned the assistant into a constantly improving system that keeps getting better with real-world use.

6. Team collaboration

This became a great example of cross-team AI collaboration:


Current Challenges

Generating too many tests than required

We addressed this by adding specific guidelines for killing each mutant type. Initial results are promising but require more usage to confirm.

LLM Model choice selection

Model choice could have an impact on test quality. Developers should know which model is in use, and override defaults when needed to match their project’s needs.

Pragmatic usage of AI tools

Developers must recognize when to stop if the AI session is unproductive. Due to LLM limitations, sometimes the AI tool may stop writing tests after miscalculating mutation coverage %. It may also get stuck in an endless loop writing tests without improving coverage. Restarting the AI session resets the LLM memory context and can help. If the hiccups still persists, this requires proper fixing of workflow instructions.


What’s next?

As adoption grows and the prompt instructions mature, this could naturally extend to more autonomous AI agent use cases. For now, we’re intentionally keeping Dev-in-the-loop to ensure that:


Credits

Exit mobile version