Java Memory Leak Troubleshooting: From 3 Days to 3 Hours

Page Contents

A complete, production-safe guide to installing yCrash and configuring the yc-360 collection pipeline.

It was a Tuesday night at 2 AM when my phone began buzzing repeatedly. Our production Java app was in deep trouble. Memory was running at 98%, response time had jumped from 150ms to over 5 seconds, and customers were inundating our support channels. I figured this would be easy, restart the app, check a few logs, and done. Instead, it devolved into three, yeah, three days of debugging hell. The frustrating part? The whole time we had yCrash installed. We simply never set it up correctly.

The slow-rolling production meltdown began when memory usage climbed from 60% to 95%. Swap was required even after the process was killed. Users start to file: “I think the production web servers are taking a very long time to respond on my request, I will refresh and try again”.

By Monday afternoon, our monitoring indicated that memory started creeping up from its usual 60% to around 75%. “Probably just cached data, ” said someone in Slack. Nobody was particularly concerned. Tuesday morning, we reached a 95% completion and the app had come to a standstill. GC pauses were increasing, users were squawking, and management wanted answers.

We knew it was a memory leak. It was just a matter of finding it.

Like any team in crisis mode, we began with what we did know. Then I generated a heap dump with jmap, it took forever to download from the VPN because of the file size: near 6GB. I took a look at it with Eclipse MAT, and my laptop fan began flipping out. 20 minutes later MAT eventually loaded the dump but to analyze it was like looking for a needle in a haystack. Millions of things, and a gazillion references, by hour three everything looked suspicious.

Meanwhile, the application kept deteriorating. We rebooted around 10 AM to try and buy some time, but we were back in the same predicament by 3 PM. My manager would ask for progress reports every 30 minutes, and my response was a meager “still researching.”

Manual Thread Dump Analysis vs Automated Root Cause Detection

And by Wednesday, two other senior developers had come along with me. We did everything we could. We took heap dumps at various times, grabbed thread dumps hourly, turned on verbose GC logging (which we really should have had already), and went through recent code changes line by line.

Here’s the thing, we had yCrash. It had been installed six months ago after it was demonstrated and someone thought it looked useful. We even received the Enterprise Edition and a dongle license for it. But once the facilities were established, no one actually used it. It just sat and accumulated digital dust.

I remembered we had some script named yc-360, and I looked for it on our servers. Stumbled upon it hidden in some directory. We tried running it manually. Nothing happened. Neither were any errors reported for that matter, just silence.

Well, guess what (and yes I only figured that out later) we never did set up the API key. The yc-360 script requires an API key to transmit collected artifacts to the yCrash server. So they secured the daemon not based on if you are sending telemetry or not, since even if you opt for this not to be sent, still need the connection. We put it in and never thought about it again.

And so here we were, with this tool tuned to howl on high in minutes; it would have fixed our problem in hours, but completely worthless since we snubbed the basics and got down in the mud.

We continued analysing heap dumps manually. Three people, three different theories. One dev said they were our event listeners. The database connection pool was blamed by another. I was sure that it was the caching layer. We all could have been right in part but really had no way of knowing who was.

Why 16 Artifacts Trump 3: What’s Missing From Our Diagnostic Puzzle

Thursday morning, I was done. We’d attempted adding heap (only put the problem off), changing GC algorithms (neither here nor there), and examining and re-examining every code change in the last month (found a few spots to plug but nothing that explained this leak).

Our CTO called a meeting. “We have to have it worked out today, ” he added. Thanks for the pressure, boss.

And then Sarah from DevOps asked the obvious question: “Why don’t we use yCrash? We paid for it.”

Awkward silence.

“We tried, ” I said weakly. “The script didn’t work.”

Sarah brought up the yCrash documentation, and within five minutes had pointed out everything we’d done wrong. Which was basically EVERYTHING.

yCrash Server Setup: Getting Port 8080 Right (Or Is It 8081 for Me?)

It was Sarah who took over, starting from the ground up. First question to ask yourself: does the yCrash server run at all? Nope. It was added but never opened since someone got an error during the very first run and just quit.

She started fresh. Created a proper directory:

mkdir -p /opt/workspace/yc cd /opt/workspace/yc

Downloaded the most recent release (we were using some ancient build from six months ago):

			
wget https://tier1app.com/dist/ycrash/yc-latest.zip -O yc-latest.zip unzip yc-latest.zip

And here’s where it got interesting. The license file we received with our registration? Nobody knew where it was. (Spent us 20 minutes to dig it out of some old e-mail of somebody’s.) Sarah pasted it in the right spot:

 cp ~/Downloads/license.lic .

Starting the server must have been easy, no? Just run the script. But when Sarah attempted it, nothing transpired. The server wouldn’t start.

Another 30 minutes debugging. It seems port 8080 was already taken on our server. We had to change start script of app to listen on port 8081. Dumb, but it cost us time.”

Finally:

./start-yc-server.sh

It worked. The address of the server was localhost:8081. First one that has worked perfectly on the first try all week.

Configuring yc-360 Script: On-Demand to Continuous Monitoring

The yc-360 script is designed to collect about every artifact your application generates – GC logs, thread dumps, heap dumps, system stats and so on – and send it all over to the yCrash server for analysis. We had the script, but we never set it up right. No API key, no settings, nada.

Sarah did it right this time. Set the API key and directed it to our yCrash server, with a collection for all 16 that yCrash is capable of analyzing. We didn’t realize there were even 16 different artifacts it could identify.

At that time the documentation talked about “on demand mode” and “continuous mode”. We had been attempting to use on-demand mode (i.e. Rather Collection). That’s fine for testing, but what about the production? You want continuous mode. It is capable of automatically crawling for and collecting artifacts on a per-cycle basis.

We need to have had this up and running on day one.

Machine Learning Enabled Analysis: How You Surmised yCrash Correlated What We Didn’t

Once everything was properly set up, we began gathering data. In an hour yCrash had more data on our app that we could gather manually over the past three days.

The analysis report was in, and frankly it was humbling. Recall our three leak theories? We were all sort of right, but wrong in the same way, and failing to see what was right.

The root of the problem was our event listener system. We had a custom event bus where listeners were registered but not always removed properly. They had built up over time, especially at peak times. Everyone of them contained a reference to bigger things, and they ultimately clobbered all of our memory.

It was because we couldn’t find it manually, the leak was slow enough that individual heap dumps didn’t have a clear ‘problem’. Objects in memory were mixed with a lot of small allocations. You would notice the pattern only if you correlate heap dumps with GC logs over time; and that is what exactly yCrash detected using machine learning algorithms.

The report gave us the identification chain, that is leak pattern, when it started leaking and which classes and methods were responsible for the as well as rate of growth. This would have taken us days to figure out by hand, if we ever figured it out at all.

GC Log Correlation + Thread Dump Analysis + Heap Dump thoughts = Root Cause

What was really powerful about yCrash’s consolidated analysis however, is it didn’t just correlate between the individual artifacts. From examining a heap dump alone we see that there is memory growth. Analyzing GC logs independently pointed us to high GC pressure. Simply by viewing thread dumps alone we could see that some threads were not idle.

But yCrash combined all three and at least demonstrated to us that the threads were creating event listeners, those listeners were getting accumulated in specific data structures, itself evident from a heap dump and GC logs indicated the said objects surviving through multiple GC because of references kept around.

That’s the three-dimensional view nobody has had before. We were examining the artifacts one by one, and trying to reconstruct the tale by hand. yCrash’s machine learning algorithms had this correlation done automatically and instantly.

Now that the root cause was known, the fix was simple. We identified all listener registration points, included appropriate cleanup in both finally blocks as well as lifecycle methods, added monitoring to monitor the number of listeners going forward and implemented the fix.

In a few hours, memory utilization stabilized, GC pauses dropped and response times returned to previous levels. Three days in hell, solved in a couple of hours once we had everything aligned correctly.

Lessons Learned: Start to Collect ALL Artifacts out the Gates

Don’t make our mistakes. Here’s what we learned:

First, you must have all the 16 artifacts. Don’t pick and choose yCrash’s strength lies in correlation of various data sources. The 16 artifacts are GC logs, thread dumps, heap dumps, top command output, netstat, vmstat, iostat, application logs, kernel logs and many others. We were just manually collecting a couple artifacts and couldn’t figure out why we couldn’t find this problem.

Second: Set everything up properly from the outset. Get the API key into your yc-360 script, use a continuous mode in production (all the time running artifact collection) and configure all settings afterdering-taking it for a test drive. We took none of that time, and paid for it afterwards.”

Third, trust the unified analysis. I don’t want you to try to manually line up artifacts like we did. Let yCrash be what it’s designed for. The machine learning algorithms can discern patterns that we would never be able to see. It gets insights about GC logs from GCeasy, thread dumps analyzed using fastThread and heap dump analysis in HeapHero – all combined into a single report.

Fourth, do not try to just respond; instead take a proactive approach. Monitor before incidents occur. Set baselines while systems are healthy.Cmd-key src=”https: . We used yCrash only in desperation, but it should have been running all along.

What’s Better Than Manual Troubleshooting?

In hindsight, those three days we lost were not about the memory leak per se. It was not being properly configured for our tools. We already had yCrash, we had the license, we had everything. We simply didn’t take the trouble to establish it properly.

Now that yCrash is functioning, similar problems can be detected and resolved in hours instead of days. The all-encompassing artifact repository, automatic analysis, and consolidated reporting has revolutionized the way we approach production incidents.

If you decide to roll-out yCrash in your enterprise, just learn from the errors we made. Do yourself a favor and set it up properly from the start. Set up a comprehensive artifact collection. Enable continuous monitoring. Don’t wait until you’re in crisis mode as we did.It’s the kind of initial investment that pays itself off the first time one has a production incident. And believe me, you will have incidents. When that day comes, you will be happy you didn’t cut corners setting things up like we did. Try yCrash now.

Java Memory Leak Troubleshooting: How We Lost 3 Days, and Fixed It in Hours

Manual Thread Dump Analysis vs Automated Root Cause Detection

Why 16 Artifacts Trump 3: What’s Missing From Our Diagnostic Puzzle

yCrash Server Setup: Getting Port 8080 Right (Or Is It 8081 for Me?)

Configuring yc-360 Script: On-Demand to Continuous Monitoring

Machine Learning Enabled Analysis: How You Surmised yCrash Correlated What We Didn’t

Lessons Learned: Start to Collect ALL Artifacts out the Gates

What’s Better Than Manual Troubleshooting?

You may also like

One thought on “Java Memory Leak Troubleshooting: How We Lost 3 Days, and Fixed It in Hours”

Add yours

Share your Thoughts!Cancel reply

“Production is Secure. Is Production Troubleshooting Secure?’ Webinar

Production is Secure. Is Troubleshooting Process Secure?

Securing Production Troubleshooting with yCrash Audit Logs

About

Popular Topics

Troubleshooting Tools

Manual Thread Dump Analysis vs Automated Root Cause Detection

Why 16 Artifacts Trump 3: What’s Missing From Our Diagnostic Puzzle

yCrash Server Setup: Getting Port 8080 Right (Or Is It 8081 for Me?)

Configuring yc-360 Script: On-Demand to Continuous Monitoring

Machine Learning Enabled Analysis: How You Surmised yCrash Correlated What We Didn’t

Lessons Learned: Start to Collect ALL Artifacts out the Gates

What’s Better Than Manual Troubleshooting?

You may also like

One thought on “Java Memory Leak Troubleshooting: How We Lost 3 Days, and Fixed It in Hours”

Add yours

Share your Thoughts!Cancel reply

About

Popular Topics

Troubleshooting Tools

Discover more from yCrash