Troubleshooting JVM Outages – 3 Fortune 500 case studies Webinar

Ram Lakshmanan

2 years ago

Page Contents

In August 2024, we hosted an insightful webinar featuring three real-world case studies from Fortune 500 companies, designed to help you troubleshoot JVM outages effectively. During this session, we delved into significant outages at major enterprises, analyzing thread dumps, heap dumps, and GC logs. You’ll gain practical insights and techniques to address CPU spikes, OutOfMemoryErrors, and application unresponsiveness, all while enhancing your problem-solving skills under expert guidance.

Videos

Part 1:

Part 2:

Part 3:

Here’s the deck for you to revisit the topics covered. Access the presentation deck below to review the key takeaways and strategies discussed:

Q&A :

Below are the questions asked during the webinar, along with the expert responses provided.

Note: Many of the topics discussed here are also covered in greater detail in our course materials.

Question 1: Is there a way to capture a thread dump automatically upon detecting a slow application?

By Kamal Hashem from Vodafone

Answer: Yes, there is a proactive way. In yCrash tool we have a feature called micro-metric monitoring which predicts and mitigate performance issues before they surface in your production environment. Please do refer to this blog to get a detailed understanding of What is Micro-Metrics Monitoring?

Question 2: Most articles on the net mention that minor GC doesn’t stop the application. I strongly believe there is no such thing, and both minor and major stop the world/application! Is this a correct understanding? Can you please explain?

By Kamal Hashem from Vodafone

Answer: 100% yes, Young GC events or minor GC’s do Stop-The-World (STW) pauses, but the duration of these pauses is generally shorter compared to full GC events. While young GCs do pause the application, the impact is typically lesser due to the shorter pause times.

Question 3: If correct health checks are implemented on the load balancers, then 502 would not be an issue. As here, LB > EngineX > Tomcat will not be healthy, hence LB won’t forward any requests to that EC2 instance. Hope my understanding is correct!

By Kamal Hashem from Vodafone

Answer: As mentioned in the call, In this scenario, since Tomcat was not responding, EngineX (acting as a reverse proxy) was unable to communicate with Tomcat, leading to the 502 error.

Question 4: You must have seen a lot of patterns. Have you thought to integrate AI in your tool stack? Or perhaps avail an LLM for this? This can be used by non-technical users.

By Kamal Hashem from Vodafone

We are using a lot of ML algorithms in our tools to do this pattern recognition already.

Question 5: For the first case study: Couldn’t the high number of threads be due to load on external traffic, which ended up having a similar stack trace?

By Chaitanya Arora

Answer: It is one possibility. But that is not the case here. In this application, even when traffic volume is high, typical thread count would be 400 – 500, whereas we were seeing 1800+ threads.

Question 6: How was the OOM sacrifice child solved?

By Chaitanya Arora

Answer: Enterprise upgraded their EC2 instance and moved to a large RAM capacity.

Question 7: I mean, was it done by understanding why the container is running in an OOM state?

By Chaitanya Arora

Answer: Application was setup to run on a container which had very limited RAM capacity. JVM has multiple internal memory regions. On top of it, Kernel needs memory. All of them couldn’t run in the commissioned container. Thus Kernel was terminating the Tomcat process.

Question 8: How to analyse off heap OOM issue?

By Sarthak Shukla from Clevertap

Answer: In order to analyse this off-heap OOM issue, you can do native memory tracking (NMT) to monitor and track memory usage outside of the Java heap. You can refer to this blog post for more details on understanding Native Memory Tracking in Jav a. Also here is video, where Ram talks about 9 types of OOM, which includes OOM that happens in off-heap.

Question 9: Isn’t a young GC without STW pauses?

By Valeri Tsolov from Clevertap

Answer: Young GC events do involve Stop-The-World (STW) pauses, but the duration of these pauses is generally shorter compared to full GC events. While young GCs do pause the application, the impact is typically less due to the shorter pause times.

Question 10: What GC algorithm they were using?

By Kulbir Singh from Startree

Answer: They were using a Parallel GC algorithm. Here is a post, where we compare various GC algorithms and which one to choose.

Question 11: What state will the application threads be in when the GC is in progress? Just checking, if I start with thread dump analysis, would I be able to discover that the threads are blocked by slow GC cycles?

By Raja Raja Kaliappan from Cognizant

Answer: When the GC is in progress, the threads remain in the same state they were in prior to the GC cycle. It will not change. Our fastThread tool’s ML algorithms can identify and report when threads are paused due to GC.

Question 12: Is there a way we can decide the right heap size for my application and particularly how to distribute between Eden/Survivor/OldGen after monitoring the GC patterns?

By Raja Raja Kaliappan from Cognizant

Answer: Yes, you can determine the appropriate heap size and distribution by analyzing GC patterns. Sometimes there may be an overallocation of the heap. GC reports can reveal if there is an overallocation of the heap, allowing you to adjust and reduce the heap size accordingly for optimal performance.

Question 13: Is there a way we can proactively monitor the throughput and total GC time in real-time?

By Raja Raja Kaliappan from Cognizant

Answer: Yes, there are two approaches:

a. You can use this REST API

b. You can leverage micro-metrics monitoring (m3) to monitor not only GC, but overall application. You can refer to this blog post to understand the details of micro-metrics monitoring (m3)

Question 14: How do we know when the young generation configuration needs to be changed or optimized?

By Sanjay Gopalani from Bechtel

Answer: Performing GC log analysis will help you identify the young gen configuration changes that need to be done to your application.

Question 15: Is too many minor GCs also problematic?

By Ganesh padmanaban

Answer: Yes, frequent minor GCs can be problematic. While they don’t have the same impact as full GCs, they can still cause performance issues, especially if they occur too frequently.

Question 16: Can we run the yCrash script in Kubernetes containers? Can I run it directly in my pod? I checked the yCrash blog—YC script on Kubernetes pods. Could you please share some insights?

By Ganesh padmanaban

Answer: Yes you can. You can add the yc-agent as a sidecar to your application pod. The agent will fetch the PID of your running Java application and continuously monitor it in Micro-Metrics Monitoring mode (m3) for any application-level issues. If you prefer not to have automated metrics capture, you can also run the yc-agent in onDemand mode for manual metrics capture.

For more detailed information, please refer to the links below. If you have further questions, feel free to email us, and we’ll get back to you or schedule a working session to assist you.

Link 1 : Run yc-agent in K8 – https://docs.ycrash.io/ycrash-agent/kubernetes/deployment-options.html#run-yc-agent-in-kubernetes

Link 2 : Micro-Metrics Monitoring Mode (m3) – https://docs.ycrash.io/ycrash-agent/micrometrics.html#how-to-activate-micro-metrics-monitoring-m3

Link 3 : OnDemand Mode – https://docs.ycrash.io/ycrash-agent/launch-modes/on-demand-mode.html#launching-ycrash-agent-with-on-demand-mode

Question 17: I have a Kubernetes container app where the live and readiness probe fails after 35 RPS. There are no resource issues, but many minor GCs are happening. That’s it. We are using G1GC. Should we go for tuning?

By Ganesh padmanaban

Answer: 100% you should look at GC behaviour. If minor GCs run and pause your application, then the health check will fail.

Question 18: Why is changing the space for the younger generation solving the problem?

By Mohit Solanki from Flipkart

Answer: Adjusting the size of the young generation helps in optimizing garbage collection and memory management in the JVM. It prevents the short-lived objects from being prematurely promoted to the old generation, reduces the frequency of Full GCs, and leads to better overall performance, particularly during periods of high load. For more details on this optimization case study, refer to this post.

Question 19: For how much time do we need to take GC dumps for analysis?

By Mohit Solanki from Flipkart

Answer: It’s always a good practice to analyze the GC log with a duration of 24hrs, since in this duration your application would have seen both the high tide and low tide traffic volume. Please refer to this GC Logging Best Practices post.

Question 20: 4% is cumulative, correct? I.e., the 56-minute loss is not continuous but across the GC log time.

By Rodney Taylor

Answer: Yes, the 4% mentioned in yesterday’s webinar is cumulative and the 56 minutes loss is not continuous.

Question 21: What is the difference between user, sys, and real times?

By Rodney Taylor

Answer: 3 types of time will be reported for every single GC event in the Garbage Collection log file, user, sys, and real time

User time is the amount of CPU time spent in user-mode code (outside the kernel) within the process.

Sys time is the amount of CPU time spent in the kernel within the process.

Real time is wall clock time – time from start to finish of the call.

To get more insight into these time differences you can refer to this blog.

There are few patterns around these User, Sys, Real Time that one can observe:

Question 22: What’s the recommended setting for YoungGen and OldGen?

By Raguraman Balasubramanian from RBC

Answer: There is no one such standardized formula. However for most applications allocating Young Gen size to be 1/3 of Old Gen size will work.

Question 23: Thank you, Ram. But setting the initial heap equal to the max heap will result in full GC happening after a long time?

By Raguraman Balasubramanian from RBC

Answer: You need to study the GC behaviour to come up with optimal heap size. However there are certain advantages in setting initial & max heap size to the same value.

Question 24: Does it require the yCrash agent to be installed on VMs?

By Girish Bhopale

Answer: The agent is only required if you plan to perform micro-metrics monitoring (m3). When installing, the agent should be placed on the same host or device where the target application is running. It does not need to be installed within the JVM itself. For more details on the micromentrics monitoring, please refer to t his post.

Question 25: Is there any error log analyzer tool available?

By Girish Bhopale

Answer: Yes, we do have a error log analyser “App log” in our ycrash product suite

Question 26: How do we find the young gen config needs to change?

By Vigneshwaran V from Solartis

Answer: Performing GC log analysis will help you identify the young gen configuration changes that need to be done to your application.

Question 27: On the use case of the garbage collection; are there other configurations done to the JVM other than Xms and Xmx

By Carlos Lopez from Hitachi Vantara

Answer: Yes, it depends on the GC algorithm you are using. The following blogs will provide you with a valuable insights into tuning:

Question 28: Consider a scenario where an application is experiencing intermittent HTTP 500 errors when making requests to a third-party API. How to debug 500 errors ,We have to check kernal log

By Ravi Shanker Kumar

Answer: HTTP 500 errors can happen because of any of the following reasons:

Garbage collection pauses
Threads getting BLOCKED
Network connectivity
Load balancer routing issue
Heavy CPU consumption of threads
Operating System running with old patches
Memory Leak
DB not responding properly
Kernel issues
Backend slow downs
Hypervisor/container orchestrator not allocating enough resources

Thus you need to capture and analyse the 360 degree artifacts.

Question 29: If the response times are high are there any other layers/components that cause high response except the GC

By Srikanth Mannepalli

Answer: Yes, not only garbage collection will result in high response times. There are a wide range of factors beyond just garbage collection to cause this. A thorough analysis of the entire system, including application code, database performance, network latency, server infrastructure, and configuration settings, is essential to identify and resolve performance bottlenecks. You might want to do a 360-degree analysis.

Question 30: Does your script requires pre installed jmap? or it installs itself how safe is to run your scirpt 360

By Sumanth sure from Freshworks

Answer: No, it is not required. Our tool can work even without jmap. Please refer to this pos t to see how we workaround even if jmap kinds of tools are not installed.

Question 31: What is generally the write number of threads for jvm with taking CPU under consideration?

By Sumanth sure from Freshworks

Answer: Determining the optimal number of threads for a JVM application depends on several factors, including your server traffic, SLA requirement, device capacity. For CPU-bound tasks, start with a thread count equal to the number of cores, and for I/O-bound tasks, consider increasing the number of threads.

Question 32: What is the overhead on Micrometric monitoring?

By Gayan Phillips from IronOne Technologies Innovation Center

Answer: Overhead is almost negligible that cannot even be measured since we only monitor GClog, Application log and Thread dump. The GC log and application log are already written to disk, so the only interaction with the JVM is for the thread dump, which introduces very minimal overhead.

Question 33: For java application running on container/k8s what JVM Option/tuning are required due to memory/cPU constraints?

By Tejas Pillai

Answer: Yes, 100% JVM tuning required irrespective of whether you are running on bare metal, VMs or container/k8.

Question 34: About the 3rd case study, How was identified what caused the OOM ? Because the container was running out of native memory, it was not the VM that was running out of heap

By Coen Wouters from Planon

Answer: Because the container was running out of native memory it terminated the tomcat server. JVM’s memory utilization was still under control only.

Question 35: If I use -XX:MaxRam, is native memory then inside or outside this configured amount of memory ?

By Coen Wouters from Planon

Answer: If you use -XX:MaxRamPercentage, it only sets the JVM’s heap region’s size. It doesn’t set the entire JVM process’s size.

You can refer to this video which will help you understand the different JVM regions.

Question 36: Instead of thread dump , Shall we upload java.core and check ?

By Premanand from KLA Corporation

Answer: No. yCrash fastThread doesn’t analyze Java core dumps (with exception being IBM core dumps). However we can analyze Heap Dumps (*.hprof) files in our other tool: heaphero.io

Question 37: Is there any preferred min and max heap ratio?

By Jahnavi Chamarthi from NS Corp

Answer: Setting the initial and max heap size to the same value gives certain advantages to your application. Those benefits are discussed in this post: Benefits of setting initial and maximum memory size to the same value

Question 38: As you know, even if you have more resources in the Boxes the application have the parameter of the max memory, in this case what are the Percentage of the memory that the JVMs needs to have assigned?

By Tinku Kumar

Answer: To answer this question, you need to understand the different JVM Memory regions. This video should help you to gain this knowledge: JVM Explained in 10 minutes!