Troubleshooting Thread Leaks in Jenkins

When a core production CI/CD pipeline stalls, development velocity drops to zero.Fortunately, Jenkins is usually stable, but it’s a complex system with a high level of concurrency, so it’s not surprising that issues occasionally surface. When they do, it can be a panic situation, since DevOps depend on Jenkins to keep pipelines moving fast.

In this article, we’ll be specifically looking at thread leaks, which occur when threads aren’t terminated correctly and build up over time.

Jenkins thread leaks may cause the system to become unstable or hang. We’re likely to see performance degrading over time, and, in extreme cases, Jenkins may crash with the error OutOfMemoryError: Unable to create new native threads. The crash occurs when either the operating system thread limit is exceeded, or the JVM thread space runs out of memory.

Jenkins users rely on the application to build, test and deploy software upgrades in a timely manner. Production batch jobs also often run under Jenkins. Any delays or outages are therefore disruptive and costly, and need to be resolved quickly.

Common Symptoms of a Jenkins Thread Leak

If the leak occurs in the Jenkins agent, the most likely symptoms are:

  • Builds taking longer than usual;
  • Hung builds.

On the other hand, a thread leak in the controller may cause:

  • Slow responses;
  • Performance degrading over time;
  • GUI instability

In both cases, the system may eventually crash with this error: OutOfMemoryError: unable to create native thread.

Restarting Jenkins usually solves the problem for a while, but it’s likely to recur.

Understanding the Root Causes of Jenkins Thread Leaks

Thread leaks occur in Java for several reasons:

  • Threads are created in a loop, whose termination condition is never met. This results in more and more threads being spawned.
  • Threads never terminate. This may be because:
    • The thread itself includes a loop that doesn’t terminate;
    • The thread is blocked, waiting for a resource such as a lock, a database query or a network socket.
    • It uses poor exception handling.
  • Unbounded thread pools may grow when the system is busy, and never shrink;
  • Thread Executors are not shut down.

Jenkins often stays online for very long periods. It may be months before the JVM is restarted. This magnifies tiny thread leaks into a real problem over time.

Also, the Jenkins architecture is complex, and relies heavily on multithreading, as you can see from the diagram below.

Fig: Overview of Jenkins Architecture

Within Jenkins, therefore, thread leaks have been known to occur in several areas.

  • Source Code Management. A common problem in this area occurred with older versions of the Subversion plugin.
  • Older Versions of Java. LDAP issues in Java versions before Java 17 caused very large numbers of threads to remain alive. The packages involved were:
    • com.sun.jndi.ldap.*
    • javax.naming.*
    • java.naming
  • Plugins, particularly those that include:
    • Asynchronous Tasks;
    • Custom Thread Pools;
    • Cloud APIs;
    • External system polling;
    • Monitoring;
    • Authentication;
    • Notifications.

How to Diagnose Thread Leaks in Jenkins

To demonstrate the diagnostic process, we took some code that deliberately causes a thread leak and incorporated it into the demo plugin from the Jenkins plugin tutorial. We then used yCrash tools to work through the diagnostic process.

1. Sample Code to Create a Thread Leak

Here is the code. Class ThreadDemoLeak creates a thread of class ForeverThread inside a loop that continues until its setFlag() method is used to terminate it

package com.buggyapp.threadleak;
import com.buggyapp.cpuspike.CPUSpikerThread;
/**
* Created infinite number of threads
*
* @author Ram Lakshmanan
*/
public class ThreadLeakDemo {
private static boolean flag = true;
public static void setFlag(boolean newValue) {
flag = newValue;
}
public static void start() {
System.out.println("Thread App started");
while (flag) {
try {
// Failed to put thread to sleep.
Thread.sleep(100);
} catch (Exception e) {
}
new ForeverThread().start();
}
}
public static void stop() {
System.out.println("Thread leak problem terminated!");
}
}

Class ForeverThread sleeps for 10 minutes inside a loop that continues until its setFlag() is used to terminate it.

package com.buggyapp.threadleak;
public class ForeverThread extends Thread {
private static boolean flag = true;
public static void setFlag(boolean newValue) {
flag = newValue;
}
@Override
public void run() {
while (flag) {
try {
Thread.sleep(10 * 60 * 1000);
} catch (Exception e) {}
}
}
}

This rapidly creates a huge thread leak.

2. Quick Checks for Possible Thread Leaks

The first step in any Jenkins troubleshooting is to decide whether the problem lies with the controller, or with an agent. If builds are hanging or going slow the problem is usually, although not always, in the agent, whereas if the GUI is unresponsive, it’s probably a problem with the controller.

We can check very quickly how many threads are running in a Jenkins instance with the Linux ps -o nlwp <pid> command. This returns the number of threads as shown below when running the command against the buggy code.

~$ ps -o nlwp 894
NLWP
2300
~$

Unless the Jenkins instance is unusually large, with many agents and plugins, we wouldn’t expect the thread count to be this high. A standard, healthy Jenkins controller managing moderate workloads typically operates between 150 and 400 active threads. 

It’s a good idea to check this from time to time. This will give you an idea of how many threads your instance has under normal conditions, and whether the number is increasing.

We can get the same information under Windows using Task Manager > Detail, right-clicking the PID column then Select Columns. Tick the ‘Threads’ box, and an extra column will appear showing the number of threads for each process. This is illustrated in the screenshot below.

Fig: View Number of Threads in Windows

3. Obtaining Diagnostics

The next step is to take a thread dump. This is a snapshot of the state of all threads running in the JVM at a given moment. It’s a text file, so we could analyze it manually, but it’s likely to be very long and this would be time-consuming. It’s much quicker to load it into a thread dump analyzer, such as fastThread.

Alternatively, we could use the free open-source yc-360 script, which captures a full range of diagnostics, including three thread dumps taken at intervals. This is useful for several reasons: 

  • If we need further diagnostics after analyzing the thread dump, we have them on hand;
  • We can see what’s happening in the device or container, and the network and I/O statistics at the time of the extract, so we can correlate external and internal JVM states.
  • Comparing the three thread dumps lets us see trends;
  • We can take the entire bundle and load it into the yCrash tool for root cause analysis.

In this example, after seeing that the thread count within the controller was high, we used yc-360 to capture diagnostics, then loaded the bundle into the yCrash dashboard.

4. Analyzing the Diagnostics

The yCrash tool gave us the following interactive summary:

Fig: yCrash Root Cause Analysis Summary

Looking at this report, we can see that in each of the three dumps, the thread count was high, and it’s increasing over time. This almost certainly indicates a thread leak. The last entry shows that 1366 threads have an identical stack trace: a strong indicator that they may have been created in a non-terminating loop. We can click the link at the end to see the stack trace of the suspect threads. This takes us to the following section of the fastThread report, which includes the stack traces:

Fig: Stack Trace of Identical Threads

To a developer, the stack trace is an invaluable clue as to where to look for the problem. It shows the class and package name and the exact line of code that’s just been executed. This points us directly to the problematic package: com.buggyapp.threadleak, which is our sample code.

The package name helps us determine whether the problem is happening within Jenkins, or within a plugin. Since Jenkins and all the official plugins are open source, we can view the actual code if we need to.

The table below shows the package naming conventions and the location of the source code.

 PackagesRepository
Jenkins Corehudson.model.*Jenkins repository
 jenkins.model.*Jenkins repository
Jenkins Agenthudson.remoting.*Remoting repository
 org.jenkinsci.remoting.*Remoting repository
PluginsVarioushttps://github.com/jenkinsci/<plugin-name>-plugin

We can also view other fastThread diagnostics by clicking the ‘Threads’ button on the left of the root cause summary. Here are some sample sections of the report.

Fig: Thread Summary

The thread summary shows us at a glance the number and status of threads in the application. We can click on any section to see details of the threads in each state.

Fig: Last Executed Methods

The section in this screenshot groups the threads by the last method they executed. Again, the fact that 963 threads have the same last method indicates a possible thread leak.

For another worked example of diagnosing thread leaks, see Troubleshooting Spring Boot Thread Leaks.

How to Fix Thread Leaks in Jenkins Core and Plugins

The right fix depends on whether you’re managing a Jenkins instance or developing a plugin, but the following best practices can help eliminate thread leaks and prevent them from recurring.

If you’re an administrator:

  • Restarting Jenkins will temporarily fix the problem while you investigate. This won’t be a long-term fix.
  • Make sure all plugins are upgraded to the latest version. Much work has been done to eliminate thread leaks over the years.
  • Try removing recently-added plugins to see if it fixes the problem.
  • Upgrading to the latest Jenkins and Java version 17 or above fixes many of the earlier problems.
  • Check if the network or I/O are blocking, and if so, add resources as needed.
  • If none of these suggestions help, use the package name to determine the rogue code, and put in a bug report to the developers.

If you’re a plugin developer:

  • Make sure that any loops that create threads always have a valid terminator, even under unexpected conditions.
  • Use timeouts on any blocking operations to prevent threads from hanging indefinitely.
  • Limit the size of thread pools, so they can’t grow indefinitely. The method Executors.newCachedThreadPool() creates an unbounded thread pool, and we should avoid it. Instead, we can use Executors.newFixedThreadPool() to limit its maximum size.
  • Ensure all threads terminate at the right time.
  • Use good exception handling techniques. Always use a finally with try … catch blocks, and use it to clean up unneeded threads and resources.

Can Jenkins Thread Leaks be Predicted?

We should always monitor our systems regularly to make sure they remain healthy. From time to time, take a thread dump, and compare it to previous ones. If the number of threads is growing between dumps, it indicates a possible thread leak, unless the workload has increased significantly.

This allows us to investigate in a non-panic situation and fix the problem before it affects our workflow.

A good option is to have the yCrash agent sample metrics regularly. If it detects any unhealthy changes to the system behavior, it takes a full set of diagnostics and alerts the administrator.

Conclusion

Jenkins has a large number of concurrent tasks and hundreds of available plugins for various purposes. It’s therefore always possible that you will experience a Jenkins thread leak. Even the smallest leak can become a problem because the server runs for a long time without shutting down.

It’s therefore a good idea to monitor system health regularly and investigate developing problems before they crash the server or cause poor performance.

Share your Thoughts!

Up ↑

Index

Discover more from yCrash

Subscribe now to keep reading and get access to the full archive.

Continue reading