Troubleshooting a Performance Problem Using NewRelic

Or, how to poorly implement adding css and javascript resources to a page

One of the tools we use at work is the MagnoliaCMS. It powers some portions of some of our properties. For those who have not heard of it (I had not until a few years ago), think of it as WordPress, but more enterprise-y.

Recently we noticed that some of the properties powered by Magnolia were responding quite slowly - to the tune of 1000-2000ms (that is, 1-2 seconds). Multiply that by 100-500 concurrent users, and you have the need for quite a bit of computing resources to serve up those pages.

Magnolia is built on top of Java, and being one of the core Java guys, I was asked to come see if I could help troubleshoot the issue. One of the tools at our disposal is NewRelic, and its Transaction Traces, Transaction View, and Deployment List were all instrumental at discovering the problem.

One of the affected pages
One of the affected pages

When looking at one of the affected pages, we saw that well over 90% of the transaction time was spent in freemarker.template.Template.process(). FreeMarker is a Java-based templating engine, which is the default rendering engine inside of Magnolia. We knew that the culprit was somewhere inside of Freemarker, but this view did not tell us much beyond that. We've added some of our own functions to FreeMarker, and Magnolia adds some of its own.

The first thing we did was to instrument all of the functions we have injected into FreeMarker. NewRelic lets you add instrumentation at runtime to Java applications, which will start to show up in the transaction view after about 5 minutes or so. It turns out none of our code was the source of the problem.

Next, one of the guys on the team looked at the Deployments tab. We record every deployment of our apps to NewRelic, and it will record average response times, requests per minute, and a few other metrics for the duration of that version being deployed.

We found a range of deployments where we went from about 249ms across all of our Magnolia properties, going up to about 440ms across a couple of deployments over the course of a couple of days. This gave us a narrow range of commits to our source code. We found in the commits that we were making use of a built-in function inside of Magnolia:

${resfn.cachedCss("/foo.css")}
${resfn.cachedJs("/foo.js")}

resfn, or Resource Function, will load a given CSS or JS file, determine its timestamp, and add that timestamp to the filename so that we can effectively leverage the browser cache, while supporting updating the CSS or JS files server side. The problem is not so apparent in the example above, however, the example from their documentation shows the first hint of a problem:

${resfn.css("/foobar.*css")}
${resfn.js("/foobar.*js")}
// or
${resfn.cachedCss("/foobar.*css")}
${resfn.cachedJs("/foobar.*js")}

We opened the JAR that contained resfn to see what it was doing, though it turns out that we did not have to because resfn was absorbed into Magnolia from an open-source library. The open-source version is here.

It turns out that these functions accept a regular-expression, and will scan the filesystem looking for every file that matches that pattern. It does not matter if a path is passed in, it has to recursively scan the entire portion of the filesystem that is accessible to the Magnolia template.

So, take a portion of a site that loads several CSS and JS files, multiply that by 100+ concurrent users, plus all of the source code for all portions of the site, and you end up with a process that is heavily disk I/O bound. Add to that the fact that the regular expression is not compiled, so it has to be re-evaluated for every single file encountered, and you end up burning a lot of CPU cycles.

Thankfully, the fix was rather easy. Many of the places were were using these calls could be replaced with something simpler.

You can see the results of the live instrumentation changes, as we moved from FreeMarker to ResFN to the callback function being passed into Magnolia's filesystem classes.
Then, you can see a few fixes being deployed.
Another Magnolia site - 1.5 second average down to ~110ms.

d00e41befce29299

Leave a Reply