How We Turned Feature Flag Cleanup Into a Mostly‑Hands‑Off AI Workflow

How We Turned Feature Flag Cleanup Into a Mostly‑Hands‑Off AI Workflow

Problem

In Atlassian, we ship every new feature and code path behind a feature flag to stage rollouts safely and learn quickly about problems.

As products mature, feature flags accumulate. Each successful rollout leaves behind dead code paths, stale configuration, and follow‑up tickets to “clean up the flag later”. With company‑wide requirements for feature flag coverage, this work is important, but it’s also highly repetitive and easy to postpone.

One of my services has managed 160 feature flags since inception, with 40 created in the past 60 days which showcases how feature flags became daily feature work for developers. Especially in era of AI, this rate still increases.

Each Feature Flag has a Work Item for cleanup. Engineers must track these Work Items to remove dead code and unnecessary complexity that cause context switching and interrupt new feature development.

Given interruptions and Sprint planning, our cleanup rate lagged behind the introduction of new feature flags. Consequently, we sometimes in our Sprints, had to prioritized cleanup over tasks.

So, with our time allocated time for developer productivity, we have decided to do something with it, with help of Rovo Dev!

The first iteration

As with every new flow or experiment we wanted to start from the basic flow and improve it further in next iterations. So the first obvious one that already some of my Teammates were using, is to just use directly Rovo Dev CLI.


Why “generic AI prompts” weren’t enough

Most AI coding tools ship with a generic “feature flag cleanup” capability. In practice, that wasn’t good enough for us.

Our first experiments with default prompts had poor effectiveness: the AI often missed critical call paths, broke tests, or left configuration inconsistent. Developers quickly reached the point where fixing the AI’s output was more work than doing the cleanup manually. In other words, the tooling discouraged automation.

The main problem: the prompts didn’t understand our reality:

So instead of giving up on AI, we decided to specialize it.


Step 1: A repo‑tailored cleanup command

We started by creating a dedicated saved prompts for our service. We did it based on one of the manual feature flag cleanup commits.

Pro tip! In order to create saved prompt in Rovo Dev CLI. Use /promptsCreate a new saved prompt within your current folder

We created two commands: one for single flag cleanup on your current branch; and the other one, where Rovo Dev will create separate branches and changes for each of your specified feature flags and for each of those will re-use ‘singla flag cleanup prompt’. Here is example of such command created for Demo project:

1. Cleanup feature flag demo



2. Cleanup feature flag demo



3. Cleanup feature flag demo

4. Cleanup feature flag demo – code examples

Batch feature flag cleanup prompt – example of referencing prompt in prompt

These commands encode how feature flags work in this codebase:

Because they live in the repo, we can reuse them in multiple tools:

This aligns with our broader AI principle: reusability and a single source of truth for commands and context, stored alongside the code.

So, after all of it was done and we wanted to try it on few flags at the same time. We’ve used Rovo Dev in Jira (Work with Rovo Dev in Jira | Rovo | Atlassian Support) with our saved prompt.

Example of Rovo Dev session listing in Jira

Why was Rovo Dev in Jira valuable for us in this case? We could run multiple Rovo Dev sessions simultaneously and access them all directly through our browser.

This setup is especially useful for improving or creating saved prompts and AI flows, as it allows you to identify issues and fix them immediately with Rovo, right in the browser and at proper sale worthy of testing the flow.

Rovo Dev in Jira session for demo project

Clear visibility of steps done by AI
Rovo Dev demo project PR ready to be merged

Step 2: Measure, then self‑improve

We didn’t expect to get this right on the first try – and we didn’t.

On our initial run of the custom saved clean-feature-gate prompt, we got:

That’s already far better than the generic prompt experience, but still not “trustworthy” for automation.

To close the gap, we aimed to improve it further, reflecting daily on enhancing our flows and knowledge. With Rovo, we entered a “self-improvement” phase.

We created the improve-command.md saved prompt.

Improve prompt

Pro tip: You might not need your custom “self-improve” saved prompt. You can also use /memory reflect [file] directly from Rovo Dev!

For each unsuccessful cleanup PR, we followed the same loop:

  1. Review the AI’s changes in Rovo Dev session in Jira
Example view of Rovo Dev in Jira (source)

2. Provide clear feedback (what it missed, where it over‑simplified logic, which cases were unsafe), for Rovo Dev to fix it.

3. Run improve-command.md against the previous command, using the session context and our feedback.

4. Allow Rovo Dev to update the command definition to prevent future failures and ensure success

Self improvement loop

Rovo Dev improve prompt results

This self‑improvement loop raised our effectiveness to:

The final “failure” wasn’t really an AI failure at all – the missed logic came from a gap in our test coverage. That result was a useful reminder: good automation depends on good, clean, well‑tested code.


Step 3: Jira Automation + Rovo Dev = hands‑off generation

With a solid prompt in place, we wired it into Jira Automations.

We added a simple rule:

Feature flag cleanup automation
Rovo Dev configuration

All your saved prompts from repo visible directly in UI

We’re not reinventing anything here – we’re just:

From the developer’s perspective, “turn this into a PR” becomes as simple as labelling a ticket.

Why we love this approach? Everything is in Jira, in case of any problems, we don’t have to leave our local code, we can just quickly sneak peak into Rovo Dev session.


Step 5: A queue to avoid merge conflicts

One practical issue we hit quickly: running cleanup for many flags at once increased the chance of merge conflicts, especially in high‑churn areas of the codebase. It also added unnecessary peaks of review times for engineers.

To solve that, we added another layer of automation: a Jira‑driven queue for feature flag cleanup.

Every hour, an automation runs:

  1. Check if there are any FF cleanup tasks currently IN_PROGRESS or IN_REVIEW.
    • If yes → do nothing; the queue is “busy”.
  2. If the queue is empty:
    • Find the oldest eligible feature flag cleanup task (oldest, marked as ready, or older than 30 days).
    • Add the rovo label to that Work Item.
    • Transition it to IN_PROGRESS (so the queue is effectively locked).
    • Add a comment.
    • Send a Slack ping.
Jira Automation comment

From there, the previously described automation takes over:

The important aspect is flow control: only one cleanup is in motion at a time, which dramatically reduces noise and merge conflicts while keeping the pipeline consistently moving.


What this gave us

By investing in a repo‑specific AI flow and treating commands as first‑class assets, we achieved:


Lessons for other teams

If you want to build something similar for your own product, a few takeaways:

  1. Don’t expect generic prompts to understand your world.
    Invest in repo‑specific commands that speak the language of your codebase. Invest in proper AGENTS.md contents for any generic instructions for Rovo Dev that it’s not on specified to given use-case
  2. Store AI “brains” in the repo.
    Commands, prompts, and agent context (AGENTS.md, review-agent.md, etc.) should live with the code so they can be versioned, reviewed, and improved like any other asset.
  3. Automate the improvement of your automation.
    A simple “improve this command based on what just happened” loop compounds quickly.
  4. Add orchestration and queuing before going “fully automatic”.
    A minimal queue (like our hourly Jira rule) can solve real‑world issues like merge conflicts without complex infrastructure.
  5. Measure success in real outcomes, not just “AI usage”.
    For us, that meant:
    • How many PRs needed no manual edits?
    • How much developer time was actually saved?
    • Did code quality and safety remain high?
Exit mobile version