Bugs and tech debt are an inevitable part of a developer’s workload. Most teams live with an ever-growing backlog of these issues, and come to a kind of uneasy acceptance: it’s not great, but you can’t really fix it. That’s how we felt in the first four years of Statuspage’s existence.
Over time, our team grew more and more frustrated. It felt like we weren’t doing right by our customers, and that we were expending tons of effort on maintenance while that deep, dark ocean of work that needed to be done kept growing.
Enough was enough. We came up with a plan to tackle this burdensome “overhead” and put it into action. In six months, we reduced our ungroomed backlog to zero, and have kept it close to that ever since (as I type this, we have seven tickets in our ungroomed backlog). More importantly, we’re taking better care of our customers. Our bug response time dropped from “good luck with that!” to about a week on average. Our customers are happier and our team is healthier.
Here’s how we did it.
A broken system
In the early years before we created our Overhead process, most Statuspage bug reports were passed from customer support to the escalation engineer (EE, or “sweeper”). One of our engineers took a turn as EE every week, helping the customer support team with tickets as they came in.
Most of the time, the EE would either figure out what was wrong with a customer’s account and manually fix the data or push a simple one-liner bug fix. For more complex issues, the EE would usually ask a developer to fix it. That approach worked well enough when the team was small, but posed some major problems that only got worse as we grew. Customer support had to babysit the devs in order to get a response to customers, and response times varied widely. Ad-hoc projects assigned peer-to-peer were interfering with our mainline work. Large or complex issues would get filed as a Jira issue and never looked at again. “Put it in Jira” became code for “forget about it,” and since putting things in Jira didn’t feel valuable, people often didn’t bother, so a lot of good ideas or observations weren’t getting captured.
When we decided to change our process, we started with a few main observations:
- We can’t fix everything immediately, but if we communicate to customers and give them timely communication throughout the process, that’s okay.
- We can tame the “unpredictable” flow of interrupt work – unexpected bugs and high-priority requests – by throttling it. If we know the long-term rate of tickets flowing in, and fix at the same rate or just a little faster, then they won’t pile up.
- That ocean of work to be done was actually full of work that was done, wasn’t relevant anymore, or didn’t fit with our product vision. If we could filter out all this noise, what was left would be manageable.
And we were driven by one more guiding principle: whenever possible, we should make the process fun.
Rethinking the process
The process that evolved from these principles is actually pretty simple. We started by giving each developer two poker chips. The developers earn more poker chips when they complete Overhead tickets.
Every week, we collect a tax – typically one poker chip from each developer. The tax goes up and down as needed to make sure we’re draining the Overhead backlog.
The developers owe the tax each week, but typically have extra chips “in the bank,” so they can use their reserves to pay the tax – and avoid overhead work – on a week when they’re really busy, and tackle more overhead work on lighter weeks to rebuild their stockpile.
By empowering developers to control their own schedules, we actually get more mainline and overhead work done, with less need for micro-management.
Keeping work organized
With a mechanism in place to drain the backlog, we needed to figure out a way to keep it well-groomed. We came up with a triage process that would classify each Jira issue. Some are closed as “won’t do,” others have already been done, and for the rest, we make sure to include a detailed description of what needs to be done. A newly formed Overhead team, composed of one developer, one PM, and one support engineer, meets weekly to triage any new issues that have come in and rank all the issues we do want the developers to handle.
In the beginning, that same team met about once a month for extra sessions of backlog grooming. It took a dozen hours or so, but we eventually got the entire backlog triaged, and have kept it that way for more than two years.
With a well-maintained backlog for the developers to work with, all we have to do is set the weekly tax higher than the rate of new, triaged tickets being added to the backlog, and our bugs and tech debt steadily make progress toward zero.
Our new and improved normal
Since adopting the Overhead process, we’ve gotten our bug backlog in check without derailing our mainline feature development work.
The biggest wins have been for our customers. Because individual issues are triaged, assigned, and completed at a regular cadence, we’re fixing more bugs – and more important bugs – faster. Customers appreciate that we can respond quickly, set expectations for what can and cannot be fixed, and then deliver the fixes we promise.
Our customer support team has really benefited from this change as well. The improved process makes it easier for them to keep customers informed, and since every issue ends up with a resolution, the Support Team can always close the loop, without having to babysit developers.
The Overhead process has been fun for developers, too. The dev team enjoys managing their own workflow, and watching the backlog get steadily smaller over time has been great for team morale.
Overall, this strategy has made a huge difference for our team. We’ve battle-tested it for several years, and now we’re happy to share it with other teams in hopes that others can benefit as much as we have.
Get stories like this in your inbox