Random Mortal Combat quote:

Sonya Blade: A handful of people on a leaky boat are gonna save the world?
Lord Rayden: Exactly.

Another month and more memory problems for one of our products.

“Let mortal combat begin”

The Jira 1.5 JVM (and more recently 1.6 as well) functional tests started running out of memory last week. The first step was to get all our builds to dump out the heap if they run out of memory.

This was achieved by adding the following flags:

-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/j2ee/heapdumps

One thing to note is that this is functionality was added to Java 1.6, but it has been backported to JDK 1.5_07 and JDK 1.4_12 (I think). It wont work with earlier versions of 1.5 and 1.4, or non-Sun JVMs.
An initial investigation of the heap dumps with YourKit showed that about 138MB of the 256MB allocated where used up by the JspRuntimeContext. Tomcat bug? Possible, so I fired off an e-mail to the tomcat-users list, which eventually received the following reply:

You appear to be running Tomcat in development mode.
This means that source of every JSP is held in memory to provide
detailed messages in the event of an error. If you have large JSPs
this will hurt. It looks like this accounts for 50MB+
The compiled JSPs account for 80MB+. I am guess there is a lot of text
in some of these JSPs.
This doesn’t explain why you don’t see the error on a 1.6 jvm. Are you
sure you are still using 256MB of RAM with the 1.6 jvm?
27% of your memory is used by char[]. Have you read
http://issues.apache.org/bugzilla/show_bug.cgi?id=37793 ? It might
offer some insight.

All sounds quite reasonable as we start tomcat up with the default options (i.e. development mode) and we had already been aware of the bug that was pointed out (and had dismissed it as the main cause). It looks like Jira has simply gotten big enough that we’ve finally hit a certain threshold where 256MB of memory isn’t enough to run the func tests. That’s not to say however that there are no memory leaks in Jira.

“You must first face your fears if you are to conquer them.”

While waiting for replies from the list, we’d already started to focus in on another problem. After running the func tests, there were about 18 instances of the VelocityEngine hanging around in memory holding on to a total of about 90MBs. As it turns out we were leaking instances via a bug in Webwork.
When an action or tag gets called, Webwork basically copies a threadlocal ‘ActionContext’ into a local Map (called ‘oldContextTable’) of the GenericDispatcher. Usually something like the ActionTag will create a new GenericDispatcher and call its prepareContext() method to do this. Once the ActionTag is finished, it will call finalizeContext(), which copies the oldContext back into the threadlocal context maintained by Webwork.
Originally we thought that once the tag was finished, it would release the GenericDispatcher and everything in the oldContextTable would get garbage collected. Unfortunately this isn’t so, and 14 of the 18 VelocityEngine instances were being held on to by GenericDispatchers, referenced by various Tags still referenced via the JspRuntimeContext.
Our initial solution was to cleanup the ActionContext at the end of one of the ServletFilters in Jira like so:

//JRA-12524: Properly cleaning up the ActionContext, to avoid leaking anything into the threadlocal context forever.
Map newContext = null;
if (ActionContext.getContext() != null)
newContext = ActionContext.getContext().getTable();
if (newContext != null)
final String requestString = getRequestString(servletRequest);
log.info("ActionContext was not properly reset at the end of the request [" + requestString + "]. Resetting the context to avoid leaking memory. See JRA-12524");
//reset the context.
ActionContext.setContext(new ActionContext());

This didn’t work for 2 reasons:
1. It should have been obvious from the memory dump that what was actually holding on to the VelocityEngine (via a couple of other references) was the instance of the oldContextTable in the GenericDispatcher, and not the threadlocal ActionContext I’m clearing in the code above.
2. I broke our Orion func tests, since I applied the ‘fix’ above in the wrong place in the filter. Orion and Weblogic apply filters even for included jsps and I was therefore clearing the context too early resulting in NullPointerExceptions in the parent JSP. Perhaps I should start reading comments in the code such as:

// ensure the action context cleanup only happens the last time.
if (servletRequest.getAttribute(FILTER_ALREADY_RUN_KEY) != null)
filterChain.doFilter(servletRequest, servletResponse);
//here's where the cleanup code should go
//not here which is where I put it.

“Flawless victory”

The solution in the end was to patch up the GenericDispatcher, such that it nulls the reference to the oldContextTable, once the finalizeContext() method gets called.
There’s still a few more leaks we need to fix. The number of VelocityEngine instances was reduced to about 8-10 however, and the total memory retained is now about 45 MB.

4 more days in a leaky boat