Root Cause Analysis in Java Production: 7 Tools Every DevOps Engineer Needs

Page Contents

Today, businesses rely more and more on IT systems to remain competitive.

System outages can be expensive. The business may lose revenue, for example when an online selling platform is down. Even worse, unhappy clients may simply decide to take their custom elsewhere.

Modern systems can be highly complicated, distributed over many devices and services. When they crash or perform badly, the pressure can be enormous to not only get them back up and running quickly, but to find the root cause of the problem and proactively prevent it from recurring.

DevOps engineers need to be prepared to move fast when things go wrong. Tools to facilitate root cause analysis need to be in place, and engineers need to be skilled in using them.

In this article, we’ll look at what a root cause analysis toolbox should contain.

Root Cause Analysis in Java Systems

Let’s consider three types of production problem:

Poor performance;
System crashes;
Unexplained JVM restarts.

Other types of problems, such as incorrect end results, require a code debugging strategy, which is not the focus of this article.

So many factors can affect performance, or cause crashes, that we can’t just throw more resources at the system and hope for the best. We need good information, not only about the behavior of the Java application, but also about the environment, to be able to find a permanent fix for an issue.

Many diagnostic tools are immediately available, both as part of the JDK and as part of the underlying operating system.

It’s always worth honing our skills with these tools, as they not only give us important data quickly, but also give us deep diagnostic information. However, trying to diagnose issues with just OS and JDK tools can be a long and tedious process.

In this article, we’ll look at 7 external tools that give us fast and reliable insights into JVM performance problems and system crashes.

7 Essential DevOps Tools for Root Cause Analysis in Java Systems

Each of these seven tools plays its part in system monitoring and troubleshooting. When put together, they give invaluable insights into system behavior. The seventh tool, yCrash, can be used to integrate findings into an efficient root cause analysis.

1. Observability Tools

These keep tabs on the system as a whole, and raise the alert if it’s not performing as it should. They extract metrics related to throughput, response times, memory and CPU usage. Ideally, the metrics should be visible in graph form, and the tool should raise the alert if performance degrades beyond a set threshold.

The open-source tools Prometheus and Grafana are an excellent combination for this purpose. We can configure Prometheus to gather metrics from various data sources. Grafana then takes these metrics and builds them into a configurable dashboard, letting us customize the tools to our own needs.

Fig: Customizable Grafana Graphics

2. Java Profilers

Profilers allow us to connect to a running JVM, and take a look at critical performance factors such as memory, CPU and thread usage, garbage collection activity and more. Profilers are great for spot checks, but not so good for monitoring trends, since we only see a snapshot of what is happening at a given moment. They give us a quick insight into how the JVM is behaving right now.

VisualVM is a good, lightweight profiler suitable for most organizations. It has several useful plugins to provide extra functionality. Large enterprises may find it worthwhile, however, to opt for paid tools such as YourKit or JProfiler.

Profilers are best run remotely, since they have a large overhead and may slow down production.

Fig: VisualVM

3. Garbage Collection Log Analyzer

Garbage collection (GC) activity is a good indication of whether or not a system is performing optimally. If the GC is overloaded, response times degrade, and there may be periods of latency when the system is temporarily hung. Analyzing GC trends gives us important key performance indicators (KPI), such as throughput, latency and footprint. It also tells us whether the heap is over or under-configured, and whether we have a memory leak. GC logging should always be enabled in production, since it adds very little overhead.

Since the logs are text files, we could analyze them manually, but this would be very time-consuming. Tools such as GCeasy let us visualize GC trends, as well as offering tuning and performance recommendations. GarbageCat is another useful tool, but provides reports in text format rather than graphically.

Fig: Section of GCeasy Report Showing KPI

4. Heap Dump Analyzer

The heap is the common storage area for objects held in RAM, and it’s managed by the JVM. Most, but not all, memory problems in Java relate to this section of memory. Heap dumps are a snapshot at a given moment, and they’re useful when we suspect memory issues. The dumps are large binary files, and we need a heap dump analyzer to interpret them.

HeapHero is a powerful tool that lets us explore the contents of the heap dump interactively. It also uses machine learning to diagnose issues and suggest tuning improvements. We can browse up and down the dominator tree to find details of parents and children of each object.

Additionally, it provides a detailed breakdown of memory wastage, a class histogram and much more.

Eclipse MAT is also very popular.

Fig: Interactive Largest Object Report by HeapHero

5. Thread Dump Analyzer

Thread dumps are a snapshot of thread activity. They’re particularly useful when performance is unacceptable or the system hangs. The dump contains the status and stack trace of each active thread. This includes both application threads and JVM threads, such as the GC.

Tools such as fastThread let us explore the status of threads, details of locks held, stack traces, CPU consumption etc. FastThread also diagnoses potential issues, reports on deadlocks and offers hints for performance improvement. It identifies threads that have an identical stack trace, which may indicate a thread leak, and includes a flame graph to identify where most of the processing time is spent.

Fig: Thread Summary by fastThread

6. Distributed Trace

Modern systems may be spread over various platforms. Remote process calls, load distribution and containerized microservices are all common. When this is the case, it’s not enough to investigate a single JVM instance, since the problem may well be elsewhere.

Distributed tracing allows us to monitor requests as they pass through the system, indicating the time spent in each part of the process. We can then isolate which system or service is causing the bottleneck.

There are several good open-source transaction tracers available, including Jaeger and Zipkin.

Fig: Zipkin Dashboard

7. Java Root Cause Analyzer

Production issues can be caused by many different things, and we need to be holistic when analyzing performance. Causes can include problems with:

Coding Issues;
Network;
Input/Output;
CPU;
Memory;
JVM configuration;
Device or container issues;
Operating system issues;
Other applications running on the same device.

The yCrash tool takes into account 360° data, as shown in the diagram below.

Fig: Data Analyzed by the yCrash Tool

The tool aggregates and analyzes data from many sources, and uses machine learning to search for root causes. It integrates with HeapHero, GCeasy and FastThread to provide detailed analytics.

Data can be gathered in two ways: as continuous low-overhead sampling of live systems, or as a one-off script that we can run as needed.

The continuous sampling method can be visualized as per the diagram below:

Fig: yCrash Architecture

The yCrash server has a dashboard, where incidents such as poor performance are reported and can be investigated. This is not only invaluable in the event of a crash, but allows us to proactively investigate potential issues before they affect production.

The one-off script is open-source, and we can download it from the yCrash-360 page on GitHub. Ideally, we would run it when performance is poor, and also incorporate it via JVM switches to be run in the event of a crash. It saves the 360° data in a zip file, which can either be uploaded to the yCrash server for analysis, or investigated using other tools.

Conclusion

Sooner or later, our production service will either experience performance problems, or it will crash. When this happens, we need to be prepared with a good set of tools so we can carry out root cause analysis.

In this article, we’ve looked at seven types of tools:

Observability Tools
Profilers
GC log analyzers
Heap dump analyzers
Thread dump analyzers
Distributed traces
Root cause analyzers.

Each of these has an important part to play in keeping our systems running smoothly, and Java Devops engineers need to develop skills with all these tools.

Root Cause Analysis in Java Production: 7 Tools Every DevOps Engineer Needs

Root Cause Analysis in Java Systems