Downtime happens. That’s a fact, and it’s nearly impossible to predict. But there are some days when the chances of downtime are higher. Maybe it’s higher-than-normal website traffic, or increased app sign-ups.
When planned high-traffic days are on the horizon, it’s a good idea to spend some extra time preparing for the worst. These planned events look a little different for every industry: Universities may need to battle-proof their systems before class enrollment; Non-profits might need to prep before the masses come to donate on Giving Tuesday; and e-commerce companies definitely need to be primed for the biggest online shopping day of the year: Cyber Monday.
We believe in the importance of having a solid incident response plan and keeping customers informed throughout the duration of the downtime. While we champion being prepared for downtime every day (downtime doesn’t take nights, weekends, or holidays off), we see these planned high-traffic days as a great forcing function to get your team to look at the processes and systems you have in place before something breaks.
At Atlassian, we’ve created a thorough incident management process that we’ve honed and battle-tested over the past 10+ years. Check out our tips and processes in this free handbook.
Here comes Cyber Monday
With Cyber Monday top of mind, we asked some of our customers to share tips, tricks, and rituals for preparing for these unique days. We hope these nuggets inspire your team to think about an incident strategy that works for you.
Chris Lamontagne, VP Commercial, Teespring
“Over the past 3 years Black Friday & Cyber Monday have increasingly become our busiest days of the year at Teespring,” says Lamontagne, “and with that in mind there comes a number of unexpected challenges that require total stability across the platform.”
We have heavily invested in our infrastructure to cope with 2x usual demand and developed a ‘peak playbook’ that allows for real-time audibles as we go this period.
Max Rice, CEO, Jilt
For our customers, the Black Friday/Cyber Monday run is generally their biggest traffic period. That puts a lot of extra strain on our systems, too, since more orders at the shops that use Jilt means more automated emails get sent.
Rice noted a a handful of steps they follow to make sure the strain on their systems doesn’t negatively effect their customers:
- Load testing our infrastructure: We stress tested with 10x the normal maximum load we see in a “regular week” to uncover performance bottlenecks and fix them or find temporary workarounds that we could implement in the meantime.
- Implement a deployment freeze: We generally hold off on any major deploys during the entire holiday period, but during the Black Friday Cyber Monday (BFCM) period we do zero deploys of any kind unless it’s specifically to fix a critical bug or issue.
- Plan for more on-call coverage from support & engineering teams: This was done weeks in advance, and each engineer was assigned to specific time slots so we have coverage all weekend. That way we’re closely monitoring performance and our support team can quickly identify and respond to any reported issues from our customers.
- Develop an incident checklist & communicate to customers: We have a plan in place for what to do in case of a reliability issue, starting with how we communicate the issue to our customers and who is responsible for what. Using Statuspage makes it really easy to do this.
Kevin Conroy, Chief Product Officer, GlobalGiving
It’s high traffic season at GlobalGiving. We’ve got all of our marketing campaign tickets lined up in Jira and linked over to our DevOps tickets to make sure that our servers are scaled up in time for each campaign launch. We’re making sure the whole team knows what’s happening when and how they can contribute.
Scott Baker, Vice President of Technical Operations, BigCommerce
Cyber 5 [the time encompassing Black Friday and Cyber Monday] is always an exciting week at BigCommerce; however, our preparations for this holiday season started back in April. Cyber 5 prep is something we do throughout the year, not just right before the holidays, to ensure our merchants can focus on maximizing their busiest selling season.
Baker noted that a main priority for their Site Reliability Engineering (SRE) team was to spread load across their data centers for better redundancy and scalability – all without affecting the merchants that depend on their platform. He also mentioned that they prepare for a variety of incident scenarios by conducting drills throughout the year. Finally, they follow a robust post-mortem process to learn and improve from every incident to ensure the same problem doesn’t happen twice.
As reflected in the quotes above, it’s never too early to prepare for high-traffic days (incident management is a marathon, not a sprint) and that it’s much better to be over-prepared than under-prepared for your biggest days of the year.
Want to dig in deeper?
We’ve compiled a list of additional resources to help make sure you’re ready to roll when downtime strikes:
- Get your sheet together: how to create an incident communication plan
- Four nines and beyond: A guide to high availability infrastructure
- How to not lose your s#!t during an incident
- Atlassian Team Playbook: Incident Response Game Plan