At Opsgenie, our highest priorities are uptime and performance; our product’s very purpose is to enable our customers to keep their always-on services on – always. The Opsgenie team has achieved 99.999% uptime over the last 12 months, during which we enhanced our platform with new features and integrations and joined the Atlassian family. We define uptime as the success rate of a request sent to Opsgenie, and below we’re taking a closer look at the five key elements that make this possible.
We use Opsgenie internally throughout Atlassian, and face similar, if not the same, challenges as our customers. While there are many best practices and approaches that may work for your company, these elements have helped Opsgenie to be successful and should not be seen as a one-size fits all.
1. Architecting for availability
In software development, architecture describes the design of components and their relationships. The design directly correlates to the maintainability, extendability, and scalability of the product. Selecting an architectural model based on your team’s reliability goals makes five nines easier to achieve. For example, if you’re aiming for 99.999% reliability, some guidelines might include making applications highly available and preventing data loss.
One key approach is to split applications into buckets based on their purpose: one grouping of applications for public-facing, one for business logic, and one for providing a service to clients. When Opsgenie was first founded, we split apps into two categories – public/client facing and those that were completing heavy calculations in the background. We were also operating in two zones of Amazon’s Oregon region, which meant that all types of applications had one member alive in every zone during deployments. This ensured elasticity, redundancy, scalability, and availability.
Now, we break applications into even smaller services, known as microservices, which is a common approach in modern SaaS. The microservice framework is an architecture pattern that structures applications as a collection of loosely coupled services. Since everything is broken into smaller, more maintainable pieces with different owners, it is much easier to maintain, develop, and test applications and software products. Because each microservice has a dedicated owner and cross-functional team, teams can ship code faster.
2. Infrastructure and cross availability
Building an immutable infrastructure and providing cross-availability reduces the burden on your team if something breaks – they can fix it while avoiding the pressure of an all-out failure. Other benefits include predictable deployments, easier rollback, faster recovery, and room for experimentation.
We’ve built our infrastructure to run on Amazon Web Services and to be immutable. Each component can fail without impacting the whole system, no server is irreplaceable, and we don’t modify servers when we need updates—instead we add new ones.
We also aren’t trying to reinvent the wheel. Our network, computing, messaging, and storage layers are built on mature AWS services, fully managed and serverless whenever possible. We leverage services like SQS, Kinesis, AWS Lambda, S3, and many others.
For each region where Opsgenie is deployed, the active traffic across all services is distributed over multiple availability zones. This design allows us to remove an availability zone any time, so zone-specific failures that are under a minute can be tolerated.
3. Proactive monitoring
Proactive monitoring enables teams to constantly anticipate problems before they become an incident. The goal is to minimize impact and avoid a business crisis.
At Opsgenie, we track every single service call, including its duration, response and auto-retries, and then expose them as metrics. This metric collection is then built into applications with service adapters. We also monitor all network, computing, messaging, and storage services. Each unexpected exception generates an alert that is routed to on-call responders, day or night.
Synthetic monitoring also helps us anticipate issues before they occur and minimize downtime. We test critical features against production around the clock. Finally, even if there is no error on any point, production metrics are then monitored to detect performance problems and metrics with anomalies.
4. Quick recovery during failure
Every moment in failure is a moment that your customer is frustrated or missing out on the service they’re paying for. Providing your team with the tools to plan ahead for failure ensures that these scenarios are just a blip on the radar instead of a days-long, newsworthy outage.
Failure is inevitable, but when we fail, we consider it our duty to get back up and running in the least possible amount of time – we always have a plan in place. All non-AWS services Opsgenie relies on have backup channels, which can be activated at any time without implementation change. For example, if our SMS provider has a problem in a region, we can automatically or manually switch to backup providers, minimizing the impact on the customer.
Other recovery tactics include auto-remediation actions in the event of a hardware/software failure and automated operational Slack commands that allow us to respond faster. Additionally, our SRE teams write and study disaster recovery cases, then simulate those scenarios to practice their responses (also referred to as “chaos engineering”).
5. Operational prioritization
Creating a team culture with a clear set of priorities during (and in preventing) incidents means that, when problems occur, there’s a clear path forward of what to fix and how quickly it needs to be fixed.
While it’s paramount to employ best practices that help developers write clean and maintainable code, incident response is also key. In addition to proactive monitoring, we have engineers on call at all times to respond when critical problems occur. Our MTTA (Mean time to acknowledge) is 45 seconds or less, 24 hours per day. This provides just enough time for engineers to act, yet also encourages them to fix issues that recur. Resolution of operational alerts is assigned the highest priority (even for false alarms) in our sprints.
Keeping the lights on
While there are many more things that we do, and other companies do, to continuously deploy and “keep the lights” always-on, these five elements – architecture, infrastructure, monitoring, recovery, and incident management processes – have been essential to achieving the five nines at Opsgenie. No matter what you choose to do, there are always drawbacks and positives to each tactic or method. Here, we’re focusing on what works for us, but we’d love to hear about what works for you!