Have you ever returned to work in the morning and found a couple of Trello boards with this lovely cryptic red box?
"You have been disconnected from the Trello Server for too long." – wait, what?
Over the Christmas / New Years break we deployed some changes to reduce how frequently this frustrating message appears. The red box occurs when the web client successfully reconnects to the Trello server, but fails to get the delta queue items necessary to catch it back up to date.
When you perform an action on Trello, we broadcast a delta item describing that action to any other clients subscribed to that entity, allowing for the client to update in real-time. That same delta item is persisted to a delta queue, so that disconnected clients are able to catch up when they reconnect to the Trello server.
Previously this delta queue data was stored in Redis. It was stored in a list with a maximum of 100 items, which meant if more than 100 delta queue items were appended while you were disconnected you'd get this red box. This Redis cluster was also too small for the amount of data being stored, meaning we were constantly running up against the max memory limit and evicting entire delta queues to free up space.
We've moved the delta queue storage from Redis to a Mongo TTL collection. This means we maintain each individual queue item for a week, and allow users to query up to 1000 items from their delta queue history. We have also added the potential to increase this number to allow for querying more items from the delta queue. The capped collection only uses about 150GB disk space, so we've got plenty of headroom for future growth. Also, it's been performing well for all of January, despite being the most write-heavy collection for Trello by a large margin.
We've also made some tweaks to the web client's reconnect logic that reduced the frequency with which you'll see this red box when you return to your computer:
Note that while we track the appearance of this type of red box, it's often transient. Two events are the cause for this red box: three failed reconnect attempts in the past five minutes or a failure to reconnect ten times in a row. This means "Could not connect to Trello." doesn't necessarily mean we've given up on reconnecting yet.
In December we also rolled out a change to try and avoid running into the three in five minutes limit. If our attempt to reconnect fails, we check if the machine is offline (using
navigator.onLine as well as inferring based on how quickly the request fails) and don't bump the rate limits if we believe it to be offline. This change reduced this (extremely common) type of red box by about 15%. It also reduced the less common too many consecutive failures case by about 30%. In conjunction with the delta queue changes we've reduced all red boxes by about 30%.
We've got a few more changes left to make but are on the tail end of the red box reduction work. A great start to 2018!
- Note: The y-axis scale is different on each graph. Compare with caution.