In the process of digging around investigating a performance issue with our internal build agents, I discovered rogue fluxbox processes taking 100% CPU on a large number of our build agents, here’s a write up of how I found them.
On a particular host you can get a vague idea of how busy a host is by the load average, for example:
This doesn’t tell you much. How has the value changed over time? We use collectd to collect system statistics from all our Build Engineering systems (physical and virtual servers) and send this data to our Graphite instance. We can graph the host’s load over time using graphite, for example:
So we can see trending of this particular host over time. Not very exciting. But what happens if you want to compare the load across multiple instances. You could graph them all I guess:
Not very informative. You could graph all the load averages as separate graphs, The problem there is that the y-axis takes up way too much space to make comparison possible:
You could scale the y-axis, to fit more graphs in, but you lose clarity in each graph.
As you can see, when you have a large number of hosts / data series, you can quickly use the shades of the colour to visually determine the differences, patterns and hotspots. So a number of build agents really stuck out when they where different from the rest. This lead me to investigate into each host which was showing the sustained load – turns out that they all had a rogue process of fluxbox running. We’ve updated our purge script that runs before every build to kill this process to ensure that we don’t get hit by this again, if this problem happens more frequently, we’ll look at moving to a different windowing manager.
The cubism.js library also auto updates the graph, so you could use it for a wallboard display as well. Also using a different metric from graphite is trivial. We could enhance this visualisation to overlay information about the builds that were running at the time (both on the virtual machine that the build runs on and the physical host that the VM runs on). This will allow us to identify patterns of behaviour of the build agents & hardware based on what builds are running. In preparing for this blog post, I looked at todays data, and noticed a couple of hotspots:
Looking up what builds were running at the time (manually via the bamboo web UI), the hotspots were from one particular build that runs Jira, Confluence, Bamboo and Fisheye/Crucible to test a plugin. I would have not been able to find this level of information out without this type of visualisation (and in the process I found a good test case to use when performing stress testing of our build infrastructure!)