A record high of 162,414 cars per day cross the Golden Gate bridge, in wind, rain, and countless other environmental factors (don’t say earthquakes!) that compound stress on it.
An Atlassian instance is a lot like a bridge. Users access each instance hundreds and thousands of times a day, amid conditions that include the size of the instance itself (with its load of comments, pages, repos, etc.), the apps you’ve added, API calls, and much more. When building a bridge, tests are performed to make sure it’s capable of supporting the expected load and usage. We perform similar tests as part of our development process to ensure our products can support the various levels of usage and load each of you need.
But it’s not just about testing, you need to monitor your instance regularly over time to identify any adjustments you need to make to support changes in your environment, usage and other factors. To do that, you need to know which metrics to track, create alarms for critical thresholds, and make plans for when you hit them. Do it right and your Data Center application will continue performing at an optimal level.
This advice is meant to help you choose the optimal infrastructure set up.
Gathering the right data: key metrics you should be monitoring
There are two types of metrics you need to monitor – usage and infrastructure. By understanding these differences, you can plan and implement monitoring which will highlight any required proactive actions, and make sure the right people know about them ahead of time.
Tip: Identifying ownership for tracking these metrics is not cut and dried. In large organizations, the task of managing the Data Center application, its dependent components, and underlying infrastructure might be distributed across different roles.
This is all about how your organization uses (or doesn’t use) your Data Center product(s) and is a key component of instance management. Consider questions like:
- How many users do you have?
- How many users are active at a given time?
- How many pages do you have (in Confluence)?
- How many issues do you have (in Jira)?
It’s also important to quantify any customizations you have in place – for example, custom apps you’ve connected to your instance.
Usage metrics can be tracked through a product’s administrative user interface or database. Monitor how they change over time and use these growth trends to help inform your predictions on how your load will look down the line and stay ahead of the changing infrastructure requirements to keep your instance healthy.
Here are some fundamental usage metrics you should be tracking:
Most importantly, identify sudden spikes in any of these metrics. For example, if you see the total number of issues grow by 10% in the past 24 hours, you need to identify the root cause. Often it is a misbehaving app, query or other behavior.
But, what about all that unused data taking up room within your instance? This might be the most overlooked metric. Inactive users, unused configurations like custom fields, and abandoned projects – all of it adds up. Whether it means archiving old projects (add link) or better managing your custom fields (add link), there are several strategies to keeping that unused data in check.
Once you have a sense of the basic usage metrics, it’s time to focus on monitoring your infrastructure. Every action your users perform on the product adds load, and the effect of that load can be compounded by how you’ve customized and configured your Data Center product. Infrastructure metrics reflect how your load affects the environment on which your Data Center product is hosted.
Infrastructure metrics generally require third-party tools to track, particularly if you’re tracking the performance of multiple nodes in a cluster (as opposed to just a single host). Here are some of the infrastructure metrics you should be tracking:
Trending and alerting: identify patterns and be prepared
Once you collect enough data about your organization’s usage and infrastructure load, seek out patterns that can help you identify:
- Peak/non-peak hours of usage
- Growth trends
Whenever possible, set alerts for metrics that need further investigation. For example, you might set an alert when the number of open database connections exceeds 50 over a 5-minute period or JVM heap space (as in, the amount of RAM used by Java) exceeds 80% of the available capacity. Identify thresholds and monitor them closely so that you’re always one step ahead of the problem.
Mapping an action plan
Once you have a monitoring strategy in place, it’s just as important to make sure you are identifying an action plan to limit any incidents arising in the future, no matter the severity. An incident postmortem is an excellent framework for learning from incidents and turning problems into progress, especially for the more severe or reoccurring incidents. This can help you understand the timeline, progression of events, and opportunities to improve in the future. A few of these opportunities may include auditing your apps, implementing new governance initiatives or establishing best practices or rules for your users around different actions like JQL queries or custom field creation.
This isn’t cookie cutter
Every organization’s monitoring strategy is different. It can vary depending on your environment, goals and KPIs, and much more.
Tip: Need help determining what metrics really matter? Use our Goals, Signals and Measures play to help out with this. Starting with your North Star (i.e. achieving an Apdex of 0.7 by the end of the year), this play will help you define the right goals, signals and measures you need in place to successfully reach that North Star.
To help form your own monitoring strategies, we’ve published reference architectures that describe how we monitor some of our Data Center deployments. These reference architectures are not just based on customer data, but are based on our firsthand experience with our own instances. They were shaped based on analysis of various incidents and trends we found in our own production servers. Use these references as a guide, not rules. You’ll need to develop a strategy that best suits your own environment and goals.