A Trello feature froze, but engineering didn’t: How we fixed an unreproducible React bug

Recently, the Trello engineering team fixed a React bug where Trello's date picker wouldn't render correctly. When some customers tried to edit the date on a Trello card, they'd only see a blank popover. At first, no one on the team was able to reproduce the bug or anything similar to it.

The solution turned out to be a one line change. However, it was a challenging process to find this solution, and I hope some of the debugging strategies we used can be helpful for others who work with React or browser APIs like MutationObserver.

Customers sent in screenshots showing an empty date picker popover. When working correctly, it should display a calendar to choose a due date.

The main problem

Trello's support team was unable to resolve this issue that customers had reached out about. They escalated the bug to the team I'm on since we own due date functionality.

Because the bug only affected a small set of users (ten customers), none of us were able to reproduce it.

The affected customers found that the only solution was switching to another device or another browser. They were rarely able to work around the bug in the same browser, even after disabling extensions, installing browser and system updates, clearing local data, and so on. Based on their troubleshooting, we guessed that we were dealing with a device- and browser-specific bug.

We spent a few hours trying to reproduce the bug on various devices and browsers. When that didn't work, we also tried another tool called Support Impersonation. With a customer's consent, an engineer can log into the customer's Trello account to reproduce a bug. This tool helps us investigate bugs that are caused by a specific customer's data. We were not able to reproduce the bug with Support Impersonation either.

Diving into clues

Feeling a bit directionless and hopeless, I brewed a cup of coffee and starting skimming through the support tickets for clues.

Clue 1

One clue was that a customer mentioned that they could edit a card's date if they used the card's quick edit menu. For context, there are multiple ways to change the due date on a Trello card. The quick edit menu can be shown by clicking on a card's pencil icon, rather than opening the card.

Screenshot of a Trello list. When hovering the mouse over a card in this list, a pencil icon appears next to the card's name.

Why would only one of these paths work? Regardless of where we render the date picker, it's still the same component and the same code.

First, I verified my assumption that we actually render the same component in both places, using some high-tech console.logs. I confirmed it was LazyDateRangePicker, a lazily-loaded React component, in both cases.

With Chrome DevTools open, I noticed that we rendered the popover — the container for LazyDateRangePicker— differently in each case. Specifically, the quick edit menu was using Trello's legacy tech stack.

Some older parts of Trello are written in the legacy tech stack, using older technologies like Backbone.js. Newer parts of Trello use a modern tech stack, including React, TypeScript, and GraphQL. We're incrementally migrating parts of Trello to the new tech stack, which means we often have visually-identical components built in both stacks. In particular, our Popover component existed in both stacks.

The quick edit menu rendered a popover in our legacy tech stack. I could quickly tell because the DOM included global css classes like pop-over-content, which we don’t use in the modern tech stack:

Chrome DevTools: Inspecting the "quick edit" menu's date picker

▼ <div class="pop-over-content js-pop-over-content u-fancy-scrollbar js-tab-parent">
  ▼ <div class="js-react-root">
    ▼ <div class="_335woopUENvU1q" data-test-id="date-picker-form">
      …

The problematic due date menu was rendered in our modern tech stack. I could tell because the css classes are build-generated strings like _3T2HVoRLE4XDEy:

Chrome DevTools: Inspecting the opened card's date picker

▼ <section class="_3T2HVoRLE4XDEy js-react-root" data-elevation="2">
  ▶ <header class="_2UGtCq06p3VEd4">…</header>
  ▼ <div class="js-react-root">
    ▼ <div class="_335woopUENvU1q" data-test-id="date-picker-form">
      …

Given this difference, it seemed likely our bug existed somewhere in the Trello's modern stack Popover component. This component is a Trello-specific wrapper around @atlaskit/popper, which is a wrapper around react-popper.

Clue 2

On to the next clue. Several customers used language like "Trello is frozen" or "the page crashes" to describe the behavior.

Modal dialog in Chrome that says: Page Unresponsive. You can wait for it to become responsive or exit the page.
One customer screenshot showed Chrome's "Page Unresponsive" dialog

Usually, this "Page Unresponsive" dialog indicates that we're running an infinite loop somewhere, or that the amount of work we're doing is exceeding CPU and memory limitations.

At this point, I spent some time following clues that didn't lead anywhere. Feel free to skip this section, but I'm including it in case anyone else (like me!) feels imposter syndrome for spending time on ideas that don't work out.

Red herring 1

I frantically googled terms like "react [feature] infinite loop crashing," hoping for a result that would tell me exactly what was happening. I found this GitHub issue, which looked promising: The browser crashes when use React.lazy return Promise.resolve(undefined) · Issue #15019 · facebook/react. React.lazy? LazyDateRangePicker? Those both have the word lazy! This must be it, I thought!

However, when I tried modifying our bundle loading code to return undefined, it didn't cause a "Page Unresponsive" dialog.

Red herring 2

One common "infinite loop" I've seen in React is a useEffect that sets state, but also has that state in its dependency array. I skimmed through some code looking for problematic useEffects but didn't find anything.

Looking back, this type of "infinite loop" doesn't trigger Chrome's "Page Unresponsive," dialog, and you'll instead see this error in the console:

Warning: Maxium call update depth exceeded. This can happen when a component calls setState inside useEffect, but useEffect either doesn't have a dependency array, or one of the dependencies changes on every render.

or, Uncaught Error: Minified React error #185; in a production build.

Our support cases included screenshots showing no errors in the console. Thus, it probably would’ve been safe to rule out this type of “infinite loop” from the beginning.

Clue 3

The last clue I found was a customer mentioned that exiting fullscreen on their browser window fixed the bug.

Based on this observation, it seemed like viewport measurements were somehow related. For example, maybe it had something to do with the popover height in relation to the viewport height? Monitor resolution and browser zoom level?

Putting it all together

To summarize, I was looking for something that…

  • was within Trello's modern Popover component
  • could cause an infinite loop, or a significant amount of work
  • involved viewport or element measurements

I started skimming through the code in to see if anything stood out to me. I thought the bug was most likely to occur in the Trello-specific wrapper around @atlaskit/popper, otherwise other Atlassian products would be running into this bug as well. In Trello's Popover.tsx, I found this bit of code, resizeHandler, that involved viewport and element measurements.

// Prevent Popover from become taller than the viewport
const resizeHandler = useCallback(() => {
  // ...
    
  const viewportHeight = viewportElement.clientHeight;
  const containerHeight = containerElement.getBoundingClientRect().height;
  const contentHeight = contentElement.getBoundingClientRect().height;
  
  // ...
  
  const newContentMaxHeight = availableHeight - extraPixels;

  return newContentMaxHeight;
}

Could resizeHandler cause an infinite loop? I looked at how this function was used within the component.

First, where was resizeHandler called from?

useEffect(() => {
  const observer = new MutationObserver(() => {
    resizeHandler();
  });
  observer.observe(contentElement, {
    childList: true,
    subtree: true,
    attributes: true,
  });
  
  // ...
}, [/* ... */]);

Next, where was the newContentMaxHeight return value from resizeHandler used?

  <PopoverContent
    ...
    maxHeight={contentMaxHeight}
    // gets rendered as <div style="max-height: ${maxHeight}px">
  >

To recap:

  • When an attribute on PopoverContent‘s div changes, the MutationObserver calls resizeHandler.
  • resizeHandler uses viewport measurements and getBoundingClientRect to calculate the maxHeight for PopoverContent.
  • If the maxHeight has changed, the component will re-render, changing the style attribute of the PopoverContent div.
    • This triggers the MutationObserver again, restarting this loop!

The only reason we were avoiding an infinite loop here? resizeHandler consistently returned the same value for maxHeight on subsequent calls, so the style attribute didn't change.

Reproducing a similar bug

The code felt fragile, relying on consistent maxHeight calculations. The team guessed that resizeHandler was related to the bug we were solving.

Our theory was that some device- and browser-specific behavior was causing resizeHandler to return inconsistent maxHeight values. That theory didn’t seem too far-fetched to me, especially since we were relying on === equality to compare measurements that could be floating-point numbers.

To simulate this theoretical behavior, I modified getClientBoundingRect in my browser. I pasted this snippet into the browser console, which causes getClientBoundingRect to add a random value between 0px and 0.0000001px to all element height measurements.

const proxyGetRect = Element.prototype.getBoundingClientRect;
    
// Overwrite getBoundingClientRect with our own implementation
Element.prototype.getBoundingClientRect = function() {
  // Call the native getBoundingClientRect function
  const rect = proxyGetRect.apply(this, arguments);

  // Add some randomness to the measurement
  rect.height += Math.random() * 0.0000001;

  return rect;
}

After I pasted that into the browser, I was able to crash my browser by opening up Trello's date picker. I got the same "Page Unresponsive" Chrome dialog that our customers were seeing.

With that code snippet, we could successfully reproduce an infinite loop caused by resizeHandler.

Was this bug the right bug?

While we could now create a similar bug to the one customers were seeing, we didn't know if it was the same bug customers were seeing. If we were wrong about the theory that some browsers were causing resizeHandler to behave inconsistently, we would be solving a “bug” that never actually occurred in practice.

Fortunately, Trello has a way for customers to test if a branch fixes their bug. A release admin can create a link that allows customers to try a specific build of the Trello web client. The customer will click the link and see this screen:

Screenshot of a web page that says: Does this version of Trello fix your issue? It includes a button that says: Test using a modified version of Trello.

With that in mind, we put together a solution for the resizeHandler infinite loop. It was a one line change—we’d check that maxHeight changed by a threshold of at least 1px before re-rendering.

We opened up one of the original support cases and sent the build link to that customer, asking if it fixed their issue.

Screenshot of a support ticket. A Trello engineer sent the customer the build link.

They quickly responded, saying that this build of the Trello client solved their issue!

We merged the solution, and after that, we stopped receiving support escalations related to the frozen date picker bug. Because we didn't release any other notable changes to Trello around the same time, we can be reasonably confident that our solution was what fixed our customers’ issue.