For months, Charlie Arehart, Mike Brunt, myself, and the awesome folks at EdgeWeb Hosting have been trying to track down an infuriating ColdFusion server issue. The issue was elusive and unpredictable. Sometimes it would happen in the middle of the night; sometimes it would happen in the middle of the day; sometimes it would happen every hour; sometimes it would go days without happening. But, when it did happen, the world would stop for a few seconds and our FusionReactor "web metrics" would look like this:
... and pretty much every stack trace in the subsequent FusionReactor Alert would be stuck in the JDBC pool, either trying to checkIn or checkOut a connection:
We tried everything! Upgrading the JVM. Downgrading the JVM. Patching ColdFusion. Checking all the network activity. Disabling services on the box(es). Tweaking the MySQL configuration. Tweaking the ColdFusion code. Swapping the ColdFusion licenses (Standard vs. Enterprise).
Nothing seemed to work!
Then, as the traffic on our servers increased, the issue appeared to become more predictable. I noticed that it seemed to be happening at the same minute of every hour. At the same time, Mike Brunt extracted some statistics from the server that showed that the JVM was running some automatic full Garbage Collection every hour. Could the two be related?
To test, I manually requested a Garbage Collection in the FusionReactor "web metrics" dashboard. And, lo and behold, I was able to trigger the same graphs and the same kind of FusionReactor alert (complete with JDBC-oriented stack traces)!
Mike then disabled the automatic, hourly Garbage Collection on all of our web nodes. And so far, we have not gotten a single FusionReactor alert! It looks like the full Garbage Collection was, indeed, the culprit... at least, as best as we can tell so far.
This has been driving me crazy for months! Hopefully, this will help someone else.