Production Troubleshooting: A Holistic Approach to JVM Diagnostics

Annya Arun

2 days ago

Page Contents

When something goes wrong with a production system, every minute that the system is down costs money. Alarms go off, users have problems. You need to find a solution quickly. You usually do not have the time to think clearly or work in a good space. This guide is for developers, DevOps engineers, and SREs who have to fix problems with the JVM in live production systems.

Problems with the JVM do not happen alone. You might see a memory leak in the GC logs at first. Then you see it in heap dumps. A deadlock shows up in thread dumps. When the CPU usage goes up, it is visible in the system metrics. To fix problems quickly, you need to look at all the information, like GC logs and thread dumps, heap dumps, and system metrics, to get a clear picture of what is happening.

This guide will show you how to troubleshoot JVM problems in a production system. It will tell you what information to collect, in what order to collect it, and how to use the information, from GC logs and thread dumps, heap dumps, and system metrics to fix the problem. The guide will walk you through the process of fixing JVM problems in a production system, including what to do and how to do it.

The JVM Production Troubleshooting Toolkit

Before you can diagnose anything, you need to know which artifact to reach for, and when. The table below maps common production symptoms to the diagnostic tools most likely to surface the root cause.

Artifact	When to Use
GC logs	Latency spikes, OOM, throughput degradation
Thread dumps	Application hang, high CPU, deadlock suspicion
Heap dumps	OOM, memory leak, gradual memory growth
top / vmstat	CPU, memory, swap usage
iostat	Disk I/O contention
Network logs	Slow external calls, connection issues

Each of these tells a different part of the story. GC logs show memory behavior over time. Thread dumps capture what every thread in the JVM is doing at a specific moment. Heap dumps show what’s sitting in memory and why it’s being retained. System metrics reveal whether the problem is inside the JVM at all or whether the OS is the constraint.

There are several options to analyze these artifacts, but no single tool replaces the discipline of reading them together.

Heap dump caution: On large JVMs (8 GB heap or more), generating a heap dump triggers a Stop-The-World pause that can freeze your application for 30–120 seconds. Capturing a heap dump during peak traffic can worsen an already degraded service. Plan accordingly. Here’s a quick read if you want to know Why Java GC Freezes Your Application.

Symptom-Based JVM Troubleshooting Workflow

Recognizing symptoms correctly is half the battle. The same surface behavior – slow response times, high CPU, and unresponsive UI – can have completely different root causes, depending on what the artifacts actually show. Below are the four most common JVM failure modes in production, with specific guidance on what to look for in each.

Symptom 1: Application Appears Frozen or Hung

What you observe: The application stops responding. Requests time out, health checks fail, and the process appears alive but makes no progress. Users report the system as “hung” or “stuck.”

What it typically indicates: A deadlock, threads waiting on each other in a cycle, severe lock contention, or the application blocked in an extended GC pause. In rare cases, all worker threads are blocked on a slow or unresponsive external dependency.

How to proceed:

There are 9 options to capture thread dumps from a running JVM. Use whichever fits your environment. Capture three of them, 10 seconds apart. If the dumps are identical across all three, the application is genuinely stuck rather than just slow.

Look for a Found one Java-level deadlock section in the output. If it’s there, the stack traces below it show the exact code paths involved. If there’s no deadlock reported, identify threads in BLOCKED or WAITING state and trace which thread holds the locks they’re waiting on.

Cross-check GC logs as well. A long Full GC or a series of back-to-back collections can mimic a hang. Threads aren’t deadlocked, they’re just waiting for GC to complete.

Example, deadlock in a thread dump:

			
Found one Java-level deadlock:
=============================
"Thread-1":
  waiting to lock monitor 0x000000001a2b3c4d (object 0x00000000f1234567, a java.lang.Object),
  which is held by "Thread-2"
"Thread-2":
  waiting to lock monitor 0x000000001a2b3c5e (object 0x00000000f2345678, a java.lang.Object),
  which is held by "Thread-1"

		

This tells you exactly which threads are deadlocked and on which objects. The stack trace below this section shows the precise lines of code involved.

Symptom 2: High CPU Usage in the Java Process

What you observe: CPU utilization for the Java process spikes to 90–100% and stays there. Response times degrade, and the system may become unresponsive under load.

What it typically indicates: One or more threads burning CPU in a tight loop, expensive computation, or regex with catastrophic backtracking. GC thrashing, where the garbage collector runs almost continuously, can also drive CPU to saturation, and it’s worth ruling that out first before assuming it’s application logic.

How to proceed:

Capture three or more thread dumps, 10 seconds apart. Threads that remain RUNNABLE at the same stack frame across multiple dumps are the CPU consumers. For the exact correlation, run top -H -p <pid> alongside your thread dumps: this shows per-thread CPU usage. The nid field in the thread dump is the native thread ID in hex; convert it to decimal and match it to the PID column in top -H.

There’s a detailed walkthrough of marrying thread dumps with top -H output for exactly this kind of CPU spike diagnosis.

Check GC logs as well. If GC frequency is extremely high and throughput is low, the root cause may be GC thrashing rather than application code, and the fix is heap sizing or allocation rate reduction, not a code change.

Example, identifying a hot thread:

			
"worker-thread-5" #42 prio=5 os_prio=0 tid=0x00007f1a2c003800 nid=0x1a3f runnable
java.lang.Thread.State: RUNNABLE
  at com.example.ReportProcessor.parseLogLine(ReportProcessor.java:187)
  at com.example.ReportProcessor.processAll(ReportProcessor.java:104)

The nid=0x1a3f converts to decimal 6719, which maps directly to the PID shown in top -H. If this thread appears at the same stack frame across multiple dumps, parseLogLine at line 187 is your hot spot.

Symptom 3: OutOfMemoryError (Java Heap Space or Metaspace)

What you observe: The JVM terminates with an OutOfMemoryError, either Java heap space or Metaspace. The application may have been gradually slowing down before the crash, or the failure may occur suddenly under a load spike.

What it typically indicates: The heap or Metaspace has been exhausted. Common causes are memory leaks (objects retained longer than intended), an undersized heap for the actual workload, or a sudden allocation spike that exceeds available memory.

How to proceed:

First, make sure -XX:+HeapDumpOnOutOfMemoryError is enabled. Without it, the JVM won’t write a heap dump at the moment of failure, and post-mortem analysis becomes guesswork. You need the dump captured at the right moment; leaking objects can get garbage collected after OOM is thrown, making the leak much harder to isolate.

Once you have the heap dump, analyze it for leak suspects: large retained sizes, unexpectedly high instance counts for domain objects, and GC root paths that reveal why objects are being retained, static caches, listeners, and session-scoped maps. HeapHero surfaces these automatically, including the object reference chain, keeping each suspect alive.

Review GC logs for the period before the OOM. Gradual Old Gen growth over hours or days strongly suggests a leak. A sudden spike with no prior growth trend points more toward a one-off allocation burst. Garbage collection patterns like repeated Full GCs with no memory reclaimed are a reliable early warning; the OOM itself is often the last event in a pattern that’s been building for hours.

Also, check the deployment history. A new unbounded cache, a changed eviction policy, or a newly registered listener frequently explains sudden memory growth in an otherwise stable application.

Example, GC log showing Old Gen exhaustion before OOM:

			
[2024-11-01T14:22:01.123+0000] GC(142) Pause Full (Ergonomics)
[2024-11-01T14:22:04.891+0000] GC(142) Old: 7812M->7814M(7820M)
[2024-11-01T14:22:04.891+0000] GC(142) Pause Full 3768ms

Old Gen is at 7814M of 7820M after a 3.7-second Full GC; it is not reclaiming anything. The next Full GC will very likely trigger an OOM. At this point, capturing and analyzing a heap dump is the right next step.

Symptom 4: Slow or Degraded Response Times

What you observe: Latency increases, p50, p95, or p99 response times rise, without an obvious crash or hang. Throughput may drop. The degradation can be gradual or intermittent.

What it typically indicates: GC pause time spikes, I/O bottlenecks (database, network, disk), lock contention, or resource exhaustion, such as connection pool depletion. The root cause often spans both the JVM and the infrastructure underneath it.

How to proceed:

Capture thread dumps during the slow periods. Look for threads blocked on I/O, socketRead0, JDBC driver calls, NIO selectors. If a large proportion of your executor threads show this pattern, the bottleneck is external, not the JVM. There’s a detailed breakdown of diagnosing CPU spikes and latency degradation from thread dump evidence worth reviewing alongside this.

Analyze GC logs for pause time distribution. If p99 GC pauses exceed your latency SLA, GC tuning or a collector change may be required. Understanding how to read GC logs systematically, allocation rate, promotion rate, and pause distribution is what turns a GC log from noise into a diagnostic signal.

Review system metrics as well. High disk utilization (iostat), swap activity (vmstat), or network latency can all exacerbate JVM behavior or be the actual root cause. JVM issues and infrastructure issues often appear simultaneously and need to be ruled out in parallel.

Example: thread dump showing I/O-blocked threads:

			
"http-nio-8080-exec-23" #89 prio=5 os_prio=0 tid=0x00007f1a2c1a3000 nid=0x1b2c
java.lang.Thread.State: WAITING
  at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
  at com.mysql.cj.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:1839)
  at com.example.UserRepository.findBySessionId(UserRepository.java:62)

		

If, for example, 20 out of 25 executor threads show this pattern, your database is the bottleneck, not the JVM.

How to Capture JVM Diagnostic Artifacts in Production

Manual capture of thread dumps, heap dumps, and system logs is error-prone under pressure. During an active incident, the last thing you want is to be hunting for the right jstack invocation or forgetting to grab vmstat alongside the thread dump. The yc-360 script is a lightweight utility designed to run directly on production servers and capture a comprehensive artifact set automatically, GC logs, thread dumps, vmstat, iostat, top, top -H, and network statistics in a single pass.

Note on heap dumps: yc-360 does not capture heap dumps by default, given the Stop-The-World pause risk on large JVMs. To include one, explicitly pass the -hd argument. Only do this when you’ve confirmed a memory-related issue and can tolerate the pause.

The script supports three modes.

1. On-Demand Capture

Best for incident response when you need artifacts immediately. You invoke the script manually with the target application’s PID. It captures all configured artifacts and transmits them to the yCrash server, where the incident appears in your dashboard with a unified report, cross-artifact correlation, and root cause recommendations.

Here’s how you configure and run yc-360 in On-Demand Mode.

2. Only-Capture Mode (Air-Gapped or Data-Residency Environments)

Best for environments with strict data residency requirements or air-gapped infrastructure. The script captures the same artifact set but saves everything locally as a ZIP file. Nothing is transmitted automatically. You can upload the bundle to yCrash manually when ready, giving you full control over whether diagnostic data leaves your environment.

3. Micro-Metrics Monitoring (M3) Mode

Best for proactive monitoring without manual intervention. In M3 Mode, yc-360 runs as a background daemon, collecting lightweight metrics every 3 minutes (configurable via m3Frequency). When it detects an anomaly, a memory trend pointing toward OOM, for example, it triggers a full artifact capture automatically, before anyone has to be at the keyboard.

Here’s how you configure and run yc-360 in M3 Mode.

Correlating JVM Artifacts: Reading the Full Picture

When you are trying to figure out what is going on with your JVM, individual artifacts do not tell you everything. The important thing to do when you are troubleshooting your JVM in a production setting is to look at all the information you have and see how it all fits together. There are four things that keep happening when there are problems in production.

Thread dump shows BLOCKED threads – find who holds the lock. A blocked thread is waiting for a lock that another thread has. The key thing to do is to read the thread dumps and follow the chain to see what is going on: is the thread that has the lock doing something with the application, or is it a garbage collection thread? If it is a garbage collection thread, that means your application is waiting for the garbage collection to finish, so you need to adjust the garbage collection settings, not change the code.
GC logs show long pauses – align with latency graphs. Look at the timestamps of the long garbage collection pauses. Compare them to what you see on your application performance monitoring or monitoring dashboards. If the spikes match up then the garbage collection is causing the slowdown. If they do not match up then the latency is coming from something else, like the database, network or disk. This is something that’s easy to miss if you are only looking at one thing at a time.
Heap dump shows a large cache or collection – check deployment history. Sometimes a new cache that is not limited a change in the eviction policy or a new listener can explain why the memory is suddenly getting bigger. Look at the dominator tree in a heap dump to find the objects consuming memory. Look for releases that affected these objects. If you compare the dates when performance dropped with the deployment dates, you can find the problem quickly.
vmstat shows high si and so (swap in/out) – check memory limits. If the operating system is moving memory to disk the JVM performance will be very bad. If you are running in containers like Docker or Kubernetes the memory limit for your container might be too low, which means the operating system will swap even if it looks like there is physical memory. Usually the fix is to increase the container memory limit or reduce the -Xmx, not to adjust the garbage collection settings.

yCrash integrates GC logs, thread dumps, heap dumps, and system metrics into a single analysis view with cross-artifact correlation and root cause recommendations. Here are some case studies on real Fortune 500 JVM outages.

Tools for JVM Production Troubleshooting

Thread dump analysis: fastThread analyzes thread dumps and automatically surfaces deadlocks, BLOCKED threads, CPU-consuming threads, and repeating stack trace patterns without requiring manual inspection.
GC log analysis: GCeasy parses GC logs across all formats and collectors, reporting key performance indicators, throughput, pause time distribution, allocation rate, Old Gen growth trend, in a single report.
Heap dump analysis: HeapHero analyzes heap dumps to surface leak suspects, large retained objects, dominator trees, and the GC root paths keeping objects alive. There are several heap dump analyzers available; find the one that fits your workflow.
Unified diagnostics: yCrash ties all three together, and matches them up with platform diagnostics such as top. It captures artifacts automatically via the yc-360 script, and correlates them into a single root cause analysis report.

Conclusion

When you are dealing with Production JVM troubleshooting, it is better to think of it as something you do every day, rather than something you do when there is a problem. There are a few things that you should always do when you are troubleshooting:

Capture many different types of information. You cannot just look at one thing, like a thread dump or a GC log, because that will not tell you the whole story. You need to look at GC logs, thread dumps, heap dumps, and system metrics together.

Look at the timestamps for all of these things. The reason for the problem usually becomes clear when you look at two things that happened at the time, like a GC pause and a latency spike, or a heap trend and a deployment. This is where you figure out what is going on, not by looking at one thing.

Make sure you know how to use your tools before you need them. Practice using them when everything is working fine so that when there is a problem, you can use them easily.

Be careful when you are using heap dumps in production. Latency time while taking the heap dump can make the problem worse. Only use heap dumps when you know you have a memory problem. If you do need a heap dump, use the option in yc-360 so a full range of diagnostics is captured at the same time.

Production JVM troubleshooting can be a lot easier if you have a system and the right tools. You can find the problem a lot faster. Learn how to establish the root cause of every problem, so you can prevent it from happening again.