Building JVM Troubleshooting AI-Assistant Using yCrash

Page Contents

Java applications running in production environments can face performance problems such as: performance bottlenecks, high CPU consumption, memory leaks, and unresponsive threads. Diagnosing these issues manually is complex, time-consuming and requires deep JVM expertise. Wouldn’t it be cool, if you can have an AI-powered assistant that automates JVM troubleshooting?

By integrating yCrash APIs with AI-based techniques like Retrieval-Augmented Generation (RAG) and training models on real-world JVM performance case studies, you can build an intelligent assistant that helps your engineering and SRE teams solve hard performance problems faster.

Retrieval-Augmented Generation (RAG) using yCrash APIs.

yCrash provides a suite of REST APIs to analyze JVM artifacts such as thread dumps, garbage collection logs, and heap dumps. These APIs can be triggered in production environments during outages. The APIs analyze the JVM artifacts and return an analysis report in a JSON format. This response can then be stored in a vector database and used as input for the Retrieval-Augmented Generation (RAG) on top of the popular pre-trained LLM (like GPT-4, Llama, or OpenAI API).

Let’s learn more about how to use yCrash APIs for RAG. There are multiple yCrash APIs. However in this post, let’s discuss thread dump analysis API. Once you gain proper understanding on how thread dump analysis API works, you can apply that knowledge to other APIs as well. They all work similarly.

yCrash Thread Dump Analysis APIs uses advanced ML algorithms and pattern recognitions to detect anomalies in the threads. If any anomalies are detected, it will be reported in the JSON response under the ‘$.problem’ element with the appropriate severity level (i.e., FATAL, SEVERE, WARNING). Below is the sample ‘problem’ element from the API response:

 "problem": [
                {
                    "level": "SEVERE",
                    "description": "ajp-bio-192.168.100.41-7078-exec-12 thread obtained com.compuware.apm.agent.introspection.jdbc.SimpleLruReflectionCache's lock & did not release it. Due to that the finalizer thread is BLOCKED. If Finalizer thread is blocked for a prolonged period, your application will experience memory problems.",
                    "tag": "Thread High CPU Consumption"
                },
                {
                    "level": "SEVERE",
                    "description": "228 threads are BLOCKED on line #455 of java.security.SecureRandom file in nextBytes() method. If threads are BLOCKED for a prolonged period, your application may become unresponsive.",
                    "tag": "Unresponsive Threads"
                },
                {
                    "level": "SEVERE",
                    "description": "161 threads are BLOCKED on line #455 of java.security.SecureRandom file in nextBytes() method. If threads are BLOCKED for a prolonged period, your application may become unresponsive.",
                    "tag": "Unresponsive Threads"
                },
                {
                    "level": "WARNING",
                    "description": "94% of threads in the ajp-bio-192.168.100.41-7078-exec pool are BLOCKED. It can slow down the transactions. Examine their stack trace.",
                    "tag": "Idle Threads"
                }                
             ]

You can notice that API response reporting the performance anomalies in the threads like where the threads are getting BLOCKED,… Besides the high-level detail, API also points to the exact problematic root cause that is causing the performance problem in the application. Below is the excerpt from the API response, which points out the exact thread and the lines of code it’s executing that is causing the CPU to spike:

 "threadResources": [
                {
                    "threadName": "WebContainer : 18",
                    "cpu": "60.9",
                    "memory": "0.2",
                    "stackTrace": "at com/xxxxilx/toolkit/sys/TDFConnection.processMessageFromServer(TDFConnection.java:955(Compiled Code))<br/>at com/xxxxilx/toolkit/sys/TDFConnection.handleResponse(TDFConnection.java:572)<br/>at com/xxxxilx/toolkit/sys/TDFConnection.access$200(TDFConnection.java:50)<br/>at com/xxxxilx/toolkit/sys/TDFConnection$InputStreamHandler.process(TDFConnection.java:246)<br/>at com/xxxxilx/toolkit/sys/TDFConnection$InputStreamHandler.run(TDFConnection.java:238)<br/>...",                    
                },

Training the AI Model for JVM Troubleshooting

To build an AI-powered JVM troubleshooting assistant, you need to train the model using high-quality, domain-specific data. Fortunately, yCrash offers rich content that can be leveraged for this purpose. yCrash blogs contains great intellectual content on JVM troubleshooting and performance engineering, more specifically on the topics like:

a. Garbage Collection

b. Thread Dump Analysis Patterns

c. OutOfMemoryError

d. Native Memory Tracking

e. hs_err_pid analysis

yCrash also carries success stories on how Fortune 500 enterprises confronted hard production performance problems and how they went about solving them. All these contents can be used to train the LLM.

Creating JVM Troubleshooting AI assistant

Below are the steps to create a JVM troubleshooting AI assistant:

Collect and Curate Training Data: You can extract insights from yCrash blogs, technical papers, and past troubleshooting cases. Extracted data has to be formatted into Q&A pairs, mapping the problems and appropriate solutions.

Fine-tune a Language Model: You can leverage any popular pre-trained LLM model (like GPT-4, Llama, or OpenAI API) that is approved by your corporate security. This model can be fine-tuned on the JVM troubleshooting datasets created in step #1.

Implement Retrieval-Augmented Generation (RAG): Store yCrash API responses and historical troubleshooting data in a vector database. This will enable the AI assistant to retrieve relevant insights before generating responses, ensuring it provides precise, context-aware troubleshooting recommendations.

Implementing an AI-Powered JVM Troubleshooting Assistant

By leveraging yCrash APIs and AI-driven troubleshooting approach, you can create a chatbot or virtual assistant that provides real-time JVM insights. Here is an example scenario:

An SRE or Developer asks a question:

“Why is my Java application experiencing poor response time?”

AI assistant retrieves insights from yCrash API responses (real-time thread dump, GC log analysis), yCrash blogs & case studies (historical problem resolutions).

AI Assistant generates an answer, explaining the root cause and suggested fix. Below is the example response from the AI Assistant:

“Your Java application has 200 BLOCKED threads, most waiting on a database connection, which originates from the com.tier1app.academy.dao.findTotalAnswersCount() method. See whether any changes were made to this SQL query issued from this method in the recent release.”

Conclusion

We hope this post provides you with high-level insight on how to use yCrash APIs and technical contents to build an AI assistant for troubleshooting JVM performance issues. If you would like to learn more, consider exploring the yCrash API documentation.

Building JVM Troubleshooting AI-Assistant Using yCrash

Retrieval-Augmented Generation (RAG) using yCrash APIs.

Training the AI Model for JVM Troubleshooting

Creating JVM Troubleshooting AI assistant

Implementing an AI-Powered JVM Troubleshooting Assistant

Conclusion

You may also like

Share your Thoughts!Cancel reply

About

Popular Topics

Troubleshooting Tools

Retrieval-Augmented Generation (RAG) using yCrash APIs.

Training the AI Model for JVM Troubleshooting

Creating JVM Troubleshooting AI assistant

Implementing an AI-Powered JVM Troubleshooting Assistant

Conclusion

You may also like

Share your Thoughts!Cancel reply

About

Popular Topics

Troubleshooting Tools

Discover more from yCrash