Java debugging at scale: when rare events become commonplace

As an individual contributor at Atlassian, one aspect of my job is resolving bugs in production. This can be a challenging and enjoyable form of problem solving. I recently helped understand and fix an intermittent problem with Jira Cloud. It was an interesting tale, not just because of the root cause we finally uncovered, but also because it displayed a few interesting aspects of this process to me:

  • The issues we encounter change with scale, and as we grow we need to find and fix new problems.
  • Problem solving can be a very iterative process, asking questions whose answers which lead to new questions is part and parcel of the process.
  • Sometimes our questions lead to no good answer, and we hit a dead end. This doesn't fix or obviate the problem, but leads us to question our assumptions and investigate different parts of the system.

A little probability backgrounder

Suppose I toss a fair coin – there is a 1 in 2 chance that you'll see a head. If I toss it twice, what are the odds of two heads? It's 1 in 4 (= 2 x 2). Three heads from three tosses? 1 in 8 (= 2 x 2 x 2). Thirty heads from 30 tosses in a row? 1 in 1,073,741,824 – about one in a billion. This seems pretty unlikely, right?

However, what if I do this about ten times per millisecond? That is about 10,000 times per second. That works out to about 10,000 x 60 x 60 x 24 = 864,000,000 times per day. In this case, you'd expect to see the 30 tosses all come up heads on about 80 percent of days, or about five or six times a week.

In other words, events that seem rare as a once off, aren't rare at large enough scales.

What does this have to do with Jira?

As mentioned in the introduction, this story is about an intermittent problem with Jira Cloud. It happened about once a week, give or take. Infrequently enough to be weird, but repeated enough to warrant investigation.

The symptoms were very odd. It started with a stack overflow. Once one of the nodes in the Jira cluster experienced this stack overflow, that node got them in huge volumes – for example nearly 14,000 in a minute. The only way out was to kill the node. The stack, when we got it, had a repeating section that looked like:

at c.a.j.r.v.f.UserBeanFactory$$EnhancerBySpringCGLIB$$967b2ba5.buildUserBeans(<generated>)
at s.r.GeneratedMethodAccessor3885.invoke(Unknown Source)
at s.r.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at j.l.r.Method.invoke(Method.java:498)
at o.s.a.s.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:333)
at o.e.g.b.s.i.s.i.a.ServiceInvoker.doInvoke(ServiceInvoker.java:56)
at o.e.g.b.s.i.s.i.a.ServiceInvoker.invoke(ServiceInvoker.java:60)
at o.s.a.f.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
at o.s.a.s.DelegatingIntroductionInterceptor.doProceed(DelegatingIntroductionInterceptor.java:133)
at o.s.a.s.DelegatingIntroductionInterceptor.invoke(DelegatingIntroductionInterceptor.java:121)
at o.s.a.f.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
at o.e.g.b.s.u.i.a.ServiceTCCLInterceptor$1.run(ServiceTCCLInterceptor.java:61)
at o.e.g.b.u.i.PrivilegedUtils.executeWithCustomTCCL(PrivilegedUtils.java:130)
at o.e.g.b.s.u.i.a.ServiceTCCLInterceptor.invokePrivileged(ServiceTCCLInterceptor.java:58)
at o.e.g.b.s.u.i.a.ServiceTCCLInterceptor.invoke(ServiceTCCLInterceptor.java:50)
at o.s.a.f.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
at o.e.g.b.s.i.s.LocalBundleContextAdvice.invoke(LocalBundleContextAdvice.java:57)
at o.s.a.f.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
at o.s.a.s.DelegatingIntroductionInterceptor.doProceed(DelegatingIntroductionInterceptor.java:133)
at o.s.a.s.DelegatingIntroductionInterceptor.invoke(DelegatingIntroductionInterceptor.java:121)
at o.s.a.f.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
at o.s.a.f.CglibAopProxy$FixedChainStaticTargetInterceptor.intercept(CglibAopProxy.java:620)
at c.a.j.r.v.f.UserBeanFactory$$EnhancerBySpringCGLIB$$967b2ba5.buildUserBeans(<generated>)

I've compressed the package names here, the first and last frames are from classes in a com.atlassian package, and all the other frames are from open source or JDK code: sun.reflectjava.langorg.springframework, and org.eclipse.gemini. The only Atlassian code is the first and last frames, and even then they’re actually CGLIB generated proxies – an implementation detail of crossing OSGi boundaries between components (called bundles) of the Jira monolith. If you inspect the code for UserBeanFactory.buildUserBeans, it doesn’t call itself, and so we didn’t have an explanation for the apparent recursion above.

These OSGi proxies are used to bridge class loaders between bundles. When one bundle wants to call a method ( UserBeanFactory.buildUserBeans 
in this case) in another bundle, it calls a method on a proxy. This proxy locates the correct OSGi service in the other bundle and forwards the call with a bit of class loader special sauce.

Or, at least, that's what it is supposed to do. In this case, what it actually does is locate itself, and call itself, which results in an infinite regression and a stack overflow. Of course, it didn't reproduce when I ran Jira locally, even when I ran it integrated with running staging services and with request URLs that mimicked those seen in the production logging. Whenever I tried, it correctly looked up the right underlying OSGi service.

After a bit of debugging the happy path through the code, I had a theory that the symptoms could arise if the correct service had been removed, based on the way the code handled fallbacks when the correct service was missing. OSGi is a dynamic system, which allows services to be registered and unregistered at runtime, which in turn allows for code to be upgraded at runtime. This is how we upgrade bundles in our server products, providing runtime upgrades for server plugins. In cloud, we don't upgrade bundles at runtime, rather we redeploy the monolith to upgrade code. So this bundle unregistration should never happen in Jira Cloud, but perhaps our server legacy was resurfacing in some error path?

So we added logging

To investigate this, I added logging for any changes to OSGi services after startup. A fair bit of service wiring goes on as Jira starts up, but it should all stabilize before we start serving requests. In fact, we saw a little tail of additional registrations, either because the logging is turned on a bit early, or we are a bit lazy during startup. After some waiting, the problem happened again, and the logs contained:

OSGi ServiceEvent detected: type 4 service [com.atlassian.jira.rest.v2.factory.UserBeanFactory]

type 4 is an UNREGISTERING event according to the constants in org.osgi.framework.ServiceEvent. However, we didn’t know why this was being called – I had neglected to include a stack. A coworker jumped in to add logging for the stack trace for this case. After a bit more waiting, we were rewarded with:

OSGi ServiceEvent detected: type UNREGISTERING service [com.atlassian.jira.rest.api.permission.PermissionSchemeBeansFactory]

Interestingly in this case the stack overflow didn't occur, and the service isn't the one we saw before. However, it was an unregistration, and it came with a stack, which led me to a code path we have that implements request scoped beans. It iterates over a collection of destructionCallbacks to destroy request scope beans at the end of the request. Now, neither 
UserBeanFactory nor PermissionSchemeBeansFactory are request scoped, so why are they being destroyed?

So we added logging again, this time to log calls to the code path responsible for registering the callbacks which were implicated in the stack above, that is, the code which added entries to the collection of 
destructionCallbacks. My reasoning was that somehow the wrong beans were getting into that collection, and this would flag them. We immediately saw in the logs the registration of a known request scoped bean dealing with URI context, so I was confident the logging was correctly flagging registrations.

Eventually, we saw an unexpected UNREGISTERING event again. Excited, I went back to find out how it had managed to get into the destructionCallbacks – and drew a blank. No logging. That’s right: we never registered the bean for destruction, and yet we had a stack trace calling the destruction logic.

This was quite surprising.

So we looked again

At this point, I knew something very odd was going on. The logging for the adding of the callback was only seeing expected calls, and yet the destruction logic was unregistering unexpected beans. How could this happen? It led me to examine the original destruction stack again — very closely. Somehow, I reasoned, this could be called for the wrong bean. Something, somewhere, must have been mistaking destruction of the request scoped bean for destruction of the OSGi services.

Frame by frame I walked the code, keeping forefront in my mind "something must be getting the wires crossed". Finally, I found what I was looking for in Spring Scanner – a spring extension we use to make it easier to write Atlassian plugins using annotations. The code in question was:

final int hashCode = System.identityHashCode(bean);
final OsgiServiceFactoryBean exporter = exporters.get(hashCode);
if (null != exporter) {
    exporter.destroy();
    exporters.remove(hashCode);
}

Here, exporters is a Hashtable<Integer, OsgiServiceFactoryBean> – it maps beans (via their 
identityHashCode – more on that in a moment) to the service factory that created them.

This code in Spring Scanner is called from a Spring 
DestructionAwareBeanPostProcessor implementation, specifically its postProcessBeforeDestruction. The way this works is that each Spring BeanFactory is notified that a bean is about to be destroyed. This allows the factory to take action if needed. Spring Scanner uses this to remove OSGi services when plugins are uninstalled in Server.

So what is identityHashCode? It returns the same result as Object‘s implementation of hashCode for the provided object. Why on earth is this code using that?

A little bit of history

To understand identityHashCode, we need a little history. My understanding of the background is that early in its evolution, Java ran only on 32 bit machines. It didn’t use a copying garbage collector, and so objects occupied the same address in memory for their lifetime. In this world, there’s a nice cheap implementation of hashCode – you return the pointer to the object, as a 32 bit integer.

This is a great hashCode. It’s stable – it doesn’t change for the lifetime of the object. It’s also unique for the lifetime of the Object. Different objects have different hash codes. Note that this doesn’t happen for most hash codes – generally they lose information. However, in this special world, that doesn’t happen. This meant that you could use hashCode to remember Object identity without retaining the object. This gives a very convenient way to track objects you have seen without risking complications with garbage collection. In a world before weak references, this was very useful. So useful, in fact, that System.identityHashCode let you get this value, even if the class of the object provided its own hashCode implementation.

However, lots has changed since then. We now have copying garbage collectors, and so, in fact, you need to store the hash code on the object so that it is stable. We have better solutions (weak references) to garbage collection complications. However, most importantly for this story, we moved to 64 bit JVMs. When this was done, the API for 
identityHashCode was not changed. A 64 bit JVM can make more than 2^32 objects. Lots more. This means that identityHashCode can no longer be unique, no matter how you implement it.

When I learned this, it was called the pigeon hole principle. If you have more than N pigeons, and you need to put them in N pigeon holes, then at least two pigeons end up in the same pigeon hole. If we have more than 2³² objects, and we try to fit them into 2³² hash codes, then at least two objects end up with the same hash code.

As an aside, if you dig into the JVM source, my understanding is that only 31 bits get used, but the basic problem is the same.

So what is actually happening?

For many of the requests to Jira, we create a request scoped bean for handling URI context. This is the bean that I found when I added the last batch of logs. Every request that finishes destroys this bean. When this happens, we tell Spring Scanner that the bean is being destroyed. Spring Scanner checks if it is one of its OSGi beans by comparing the identityHashCode. If it matches, it destroys the OSGi service.

But wait – what are the odds of that? You need what is effectively a random Integer (the hash of the request scoped bean) to match that of an OSGi service. Two 32 bit integers lining up at random? That’s like flipping 32 coins in a row and seeing all heads. That’s a four billion to one (against) chance.

Jira processes a lot of requests. I did a quick search of the logs of our production Jira clusters on our Splunk log index, and found around ten million events in the default 15 minute window. Or 40 million requests an hour. Or nearly a billion requests a day. Our one in four billion odds should happen about once every four days.

Did I mention the bug was hitting about once a week?

Ok – so the numbers are a bit loose here. There's a lot of OSGi services in fact – thousands. There are lots of requests, but not every single request creates request scoped bean. However, lots do, and so the orders of magnitude check out.

The final proof, of course, was to fix the root cause. A colleague fixed Spring Scanner, we rolled that out to cloud Jira, and the symptoms stop. We also connected the root cause (incorrect deregistration of OSGi services) to other intermittent hard to track down bugs, and those were also rectified.

By the way, it's not just Atlassian that has this kind of code. It's also in current Spring code:

// From org.springframework.context.annotation.ConfigurationClassPostProcessor
@Override
public void postProcessBeanDefinitionRegistry(BeanDefinitionRegistry registry) {
    int registryId = System.identityHashCode(registry);
    if (this.registriesPostProcessed.contains(registryId)) {
        throw new IllegalStateException(
                "postProcessBeanDefinitionRegistry already called on this post-processor against " + registry);
    }
    if (this.factoriesPostProcessed.contains(registryId)) {
        throw new IllegalStateException(
                "postProcessBeanFactory already called on this post-processor against " + registry);
    }
    this.registriesPostProcessed.add(registryId);
    processConfigBeanDefinitions(registry);
}

However, I think it's possibly traversed rarely – that is, only at startup and shutdown – for typical usages of Spring and hence much less likely to cause problems there.

What are the takeaways?

As discussed above, this story took me down an interesting path of discovery and problem solving. I think one can learn a few things from this story, including some general observations about the nature of problem solving in a large and complex software system:

Takeaway 1

Rare events happen at scale. Our scale is increasing over time, and we should expect to see rare events more frequently as we scale up. We've got a long way further to scale, but we are already seeing one in a billion events daily.

Takeaway 2

Debugging some things will take iterations of logging in production. Jira is large and complex, and we won't always be able to reason about its behavior. Sometimes we'll need to add code to determine what is actually happening in production.

Takeaway 3

Dead ends are also information. I was really surprised when the last logging came back empty handed. I was pretty sure we were on the right track, and I really expected to see some weird corner case accidentally request scoping the wrong thing. However, this just meant back up, and re-examine what you know.