This is a guest post from Jordan Munson, Support Engineer at Wistia
What do you do when your software is experiencing a critical outage? Post an update to your status page, send out some updates via social, answer emails and calls that come in about it, etc. It all seems pretty obvious what to do in 2017, but for Wistia in 2013 things weren’t so clear. A handful of months into my tenure at Wistia, we faced what is still likely the biggest service outage in our company’s history. We were not ready, plain and simple.
The Wistia application is effectively three different, connected services: the application portal, the infrastructure that collects/creates stats, and the infrastructure that encodes and hosts our videos. In normal growing-pains fashion, we realised that we were going to outgrow a portion of the stats infrastructure in a few months. Not a big deal, that window was far out; we made the necessary changes to our infrastructure and moved along with the other projects on our plate. Fast forward a few months to one fateful Friday afternoon, one of the engineers gets a page stating that our stats database is no longer being written to. Uh oh.
As it turns out, we forgot something. A very specific, critical something. The result was our stats infrastructure grinding to a halt. Fortunately it’s a modularized system so we were still collecting data, but we weren’t turning that data into things you could see in your account. Over the weekend the engineering team repaired the issue, but we still had a massive backlog of collected stats that still needed to be processed. For the next two weeks, our stats were behind real time for customers. During this time we were almost all hands on deck covering the work in our support inbox as folks wrote in wonder what the deal was. We were completely swamped.
“I think we need a status page.”
We realized very quickly that having a status page would have made our lives much, much easier. We thought “We could just build one of those really fast, that should do just fine!” And so we did, and it was fine. We called it “Bugle” because it allowed us to sound the alarm (we’re big fans of clever naming schemes). The problem, however, was that this basically meant we were supporting a new product in addition to our normal work. This might not have been a huge deal for a larger company, but at the time Wistia was roughly 20 people. Supporting another product was simply not a cost we could afford.
After a handful of months of mild frustration around our nearly featureless, but helpful, home grown solution we decided we needed something more, something that didn’t require so much tending to. Enter StatusPage. Since the move to StatusPage, we’ve been able to do what we were looking to do along — quickly and easily keeping out customers up to date on the status of our application. It only took one massive outage and building a new product to get there.
Fast forward a couple years to modern times and our process looks way smoother. Folks get updates from us directly when there’s an outage, they know where to look for updates, and updates made to our Status Page push directly to a number of places (like Slack, for instance). We’re not impervious to outages here and there, though, especially considering how many services we rely on for critical parts of our business. One of these services is Amazon Web Services (AWS). Recently (the end of February, 2017), AWS was experiencing some serious troubles in their east coast region, which so happens to be the the most busy region for us. We we experiencing some serious troubles as a result of this, creating a partial outage for a number of our customers. We faced a similar fate when our DNS provider was slammed with a Distributed Denial of Service attack, and even when our web service was completely inaccessible, things went more or less okay (as well as you could expect considering our web portal was entirely inaccessible for most customers).
Over the last handful of years we’ve learned many lessons the hard way. Things can go wrong and they will go wrong, you should probably be ready for that. Thank goodness we are now!
This post was originally published on the Statuspage blog in 2017.