What do you do when your Jira Software instance grows from 10 to 10,000+ users? This was what Brian Wallace and Mike Damman, Vice President and Knowledge Architect at Cerner, the leading U.S. supplier of healthcare information technology, needed to answer in order to meet the needs of their growing teams.
How do you guarantee reliability across such a large instance? What can be done to mitigate the effects of downtime? These were some of the questions I tackled with the Cerner team while we identified challenges and came up with some great solutions.
Challenge 1: scale Jira Software globally
Cerner had three federated instances of Jira Software Server with thousands of developers using each instance every hour of every day around the globe. Jira Software quickly became mission-critical and every minute of downtime or performance degradation made it more difficult for Cerner team members to support their customers. They needed a solution that provided high availability.
In the fall of 2015, Cerner chose to upgrade from Server to Data Center so that they could cluster multiple active servers and provide users with uninterrupted access to Jira Software. This wasn’t just critical at the time, but knowing that they were going to add several thousand more users in the coming year, they needed a solution that would scale with them.
Challenge 2: risks to high availability
Using Zabbix and Splunk to monitor their Jira instances, Cerner was able to identify one area that needed to be addressed immediately if they wanted to provide true high availability: REST API abuse. Their log analysis showed that team members were using the REST API to get real-time status updates – so whether teams knew it or not, they were pinging Jira Software instances every single second. Cerner didn’t want to restrict users from creating custom dashboards or self-serving the data they needed, but it was obvious that they had to do something different.
“We wanted to be able to isolate REST calls to a single server so that it didn’t have an impact on other users,” Damman noted. With a multi-node cluster they could intelligently distribute traffic by dedicating one node solely to external REST API requests. Cerner also wanted to guarantee that all external requests went to this dedicated node because having users manually change the domain to an IP address, or another domain, wasn’t reliable. That’s when they reached out to me, their Technical Account Manager, to help them come up with a better solution.
The impact and rate of growth for external integrations (robot) differ from human interactions and can add stress to individual nodes in the cluster, bringing down everyone else using that node.
Solution: intelligently route traffic
Cerner needed the Data Center configuration to ensure all external REST API requests were routed away from other traffic. They planned to have four nodes in their cluster behind a load balancer with each node performing the following services:
- Node 1 – External REST API node
- Nodes 2 & 3 – Normal usage nodes
- Node 4 – Admin and power user node; not in the load balancer and only accessible by IP address
I originally thought our best option would be to use the load balancer to route all requests with ‘/rest’ to the REST API node. However, we found the REST API was also being used throughout Jira Software, including the login page, so leaving it to ‘/rest’ would mean we would still be mixing REST API traffic with normal usage. We needed a better solution.
Working with some of my colleagues, we found we could isolate REST API requests by looking for ‘/rest’ in each request AND by looking at where the request originated using the HTTP referer header. If a person was trying to login to Jira Software or was already using Jira Software, they would get directed to or remain on a Normal Usage node. Otherwise, if the person or bot was requesting the REST API, they would get directed to the REST API node.
We proposed the solution to Cerner and after a few rounds of testing, they went live with Jira Software Data Center in October 2015.
The results: performance at scale
Within their first week of implementing the proposed Data Center configuration, Cerner was seeing 4 times the amount of traffic on the REST API node as on the other two nodes. Response times are faster, CPU utilization has decreased across their non-admin nodes compared to a single server instance, and they haven’t seen a single unplanned outage in 2016, all while scaling Jira Software to thousands of new users.
Cerner needed to make sure that as they continued to add users that application responses times maintained or improved. This re-architecting proved that Cerner was able to reduce their response time by nearly half, from 150ms to 80ms. Even at peak traffic times – looking at page loads specifically – response times remained steady.
To get the full story about how Cerner scaled Jira Software while improving response times and reducing CPU, hear from Brian and Mike themselves in our on-demand webinar.