Jenkins CPU Spike: Causes, Troubleshooting & Proven Fixes

Page Contents

CPU spikes are like psychotic killers. They become vicious for no apparent reason, and often without warning. A Jenkins CPU spike freezes user GUI activity and hangs jobs, just when you need to get your test results quickly. Sometimes the system recovers. At other times, a Jenkins restart is inevitable – and even then, there’s no guarantee the problem won’t recur.

What causes this to happen? How can we fix it? Can we predict CPU spikes before they wreak havoc with our workflow?

Let’s explore these questions.

Common Symptoms of a Jenkins CPU Spike

Typically, the Jenkins system works perfectly, then suddenly CPU spikes, and it becomes almost unusable:

The GUI becomes unresponsive;
Builds are delayed;
Logins may be slow or even totally frozen;
Jenkins startup is sometimes slow;
The device or container as a whole may become unresponsive.

You may even see this:

Fig: Jenkins Failure

This behavior may be intermittent, with the system recovering spontaneously, only to exhibit the same behavior later. Often, restarting Jenkins or even the entire device may be the only way to get the workflow moving again.

What are the Root Causes of CPU Spikes in Jenkins?

In highly concurrent applications such as Jenkins, the operating system must try to allocate CPU cores fairly between threads. Most threads block naturally from time to time when they are waiting for external events or slower devices. This allows other threads access to the CPU. However, threads carrying out tasks that work with RAM only don’t naturally block, but have to be pre-empted by the operating system.

CPU spikes occur during memory-based tasks such as:

A tight loop that doesn’t terminate;
CPU-heavy tasks such as:
- Compression and decompression;
- Indexing;
- Processing large collections;
- Heavy computations;
- Garbage collection.

In Jenkins, there are two areas we need to look at when CPU usage is excessive.

1. Garbage Collection (GC) Performance

Normally, the GC works unobtrusively in the background, making sure there is always enough memory available for new requests. However, if there is too much contention for memory, GC has to work overtime. Since its tasks are very CPU-heavy, this leads to CPU spikes. During critical GC phases, all other threads are paused in what is known as a JVM stop-the-world event. If the situation doesn’t resolve, there comes a point where the entire JVM is dedicated to GC, and the application is not able to do any work. This may happen for several reasons:

The application has a memory leak, where objects aren’t being released in a timely manner;
The heap size is under-configured;
The application suffers from object churn, where objects are created and released faster than the GC can clean them up;
Too many CPU cores have been allocated to the GC.

2. Jenkins Controller, Agent, or Plugin Performance

If GC is not the problem, we need to look at the Jenkins ecosystem. Issues that have been known to cause system freezes due to CPU spikes are summarized below:

Complex pipelines, particularly with many parallel branches, can result in CPS (Continuation-passing style) overload;
Complex Groovy scripting, which may contain unterminated loops or other CPU-heavy constructs;
Large transfers;
Excessive console output;
Excessive agent reconnects;
Inefficient source code management;
Heavy load on the build queue;
File system scans, especially where triggers depend on file system states;
Excessive authentication attempts;
Compression or decompression activity.

How to Troubleshoot CPU Spikes in Jenkins

In this section, we’ll look at the most important artifacts and tools for diagnosing the root cause of CPU spikes. We’ll then work through two examples:

A deliberately-induced CPU spike caused by a faulty plugin;
GC overload caused by deliberately setting the Jenkins heap size too low.

1. What Data Should You Collect During a Jenkins CPU Spike?

The three most important artifacts for diagnosing CPU spikes are:

a) Operating system live diagnostics showing CPU usage by thread

Both Linux and Windows let us view CPU usage by thread.

In Linux, we can do this using the following command:

top -H -p <PID>

where PID is the process ID of the Jenkins application. This gives us an interactive screen as shown below:

Fig: Top CPU-using Threads in Linux

Here, the PID column is actually the native thread ID. We can easily see which threads are using the most CPU, and match this to a thread dump.

In Windows, we get the same information using the task manager’s detail tab. By default, it doesn’t break down usage by thread, but if we right-click on the Jenkins application, we can select additional columns to view, and enable Threads.

b) Thread dump in Jenkins

The thread dump is our most important artifact. It’s a snapshot of the current status of all the threads in the JVM. There are several ways to obtain a thread dump.

It’s a text file that contains an entry for every thread, including the total CPU time, the thread state, the stack trace and the native thread ID (labelled nid), which can be matched back to the output of the top command.

In a system the size of Jenkins, it will be very long, and analyzing it manually would be both time-consuming and tedious. Also, we’ll need to match it back to the CPU-consuming threads from top. To make it more difficult, top shows the native thread ID in decimal, whereas a thread dump shows it in hexadecimal. Luckily, there are several good thread dump analyzers available. In our worked examples, we’ll be using fastThread. This tool aggregates the thread dump into a meaningful report, including various graphs and charts. It also conveniently shows the native thread ID in decimal to match the top command.

c) Jenkins GC logs

In live systems, GC logs should always be enabled. They’re invaluable for evaluating GC performance and establishing whether CPU spikes are caused by GC. Using a GC log analyzer such as GCeasy, we can instantly see how much CPU the GC is consuming.

Since the three artifacts need to be used to analyze the exact time frame when the spike occurred, it’s useful to be able to gather them all at once with a single tool, along with other useful information about the JVM and its environment. The ideal tool for this is the free open-source yc-360 script.

Once downloaded from GitHub, you can unzip the script and run it as needed. The command line options are documented here.

The script creates a range of artifacts in a zip file. You can either analyze them with your own tools or upload the bundle to the yCrash dashboard for automated root cause analysis via the ‘Upload Bundle’ link at the top right.

2. Case Study: Diagnosing a Jenkins CPU Spike Caused by a Plugin

To illustrate how to troubleshoot CPU spikes within Jenkins, we took the code sample below, which contains threads in a tight loop and causes CPU usage to shoot up dramatically.

The class CPUSpikeDemo launches a large number of threads of class CPUSpikerThread at two-minute intervals.

			
package com.buggyapp.cpuspike;
import com.buggyapp.util.StringUtil;
/**
 * 
 * @author Ram Lakshmanan
 */
public class CPUSpikeDemo {
	public static final String NUMBER_OF_CPU_CYCLES = "buggyApp.CPUCycles";
	
	public static void start() throws InterruptedException {
		
		int noOfCycles = 6;
		
		if (StringUtil.isValid(System.getProperty(NUMBER_OF_CPU_CYCLES))) {
			try {
				noOfCycles = Integer.parseInt(System.getProperty(NUMBER_OF_CPU_CYCLES));
			} catch (NumberFormatException e) {
				System.out.println("Failed to parse buggyApp.CPUCycles");
			}
		}
		
		int counter = 0;
		
		while (counter < noOfCycles) {
			
			new CPUSpikerThread().start();
			Thread.sleep((2 * 60 * 1000));
			counter++;
		}
		System.out.println(noOfCycles + " threads launched!");
	}
	
	public static void stop() {
		
		new CPUSpikerThread().stop();
		System.out.println("CPU spike terminated!");
	}
}

		

The class CPUSpikerThread repeatedly calls a method that does a simple calculation, resulting in a tight loop that causes CPU usage to spike.

			
package com.buggyapp.cpuspike;
/**
 * 
 * @author Ram Lakshmanan
 */
public class CPUSpikerThread extends Thread {
	
	private static boolean flag = true;
	
	public static void setFlag(boolean newValue) {
		flag = newValue;
	}
	@Override
	public void run() {
		
		while (flag) {
			
			doSomething();
		}
	}
	
	public static void doSomething() {
	// Does a calculation
      int a=2+2;      
	}	
}

		

We included the package com.buggyapp as a third-party library in an amended version of the Jenkins plugin tutorial’s demo application. When activated in Jenkins, it caused a CPU spike.

The output from the top -H -p command looked like this:

			
top - 11:20:50 up  1:43,  4 users,  load average: 12.11, 11.70, 8.08
Threads:  58 total,  12 running,  46 sleeping,   0 stopped,   0 zombie
%Cpu(s): 87.3 us, 12.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   3357.5 total,    238.3 free,   1134.8 used,   1984.4 buff/cache
MiB Swap:   2048.0 total,   2042.2 free,      5.8 used.   1928.9 avail Mem 
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
jenkins   20   0 4644172 593472  31760 R  25.0  17.3  10:54.58 Thread-+
jenkins   20   0 4644172 593472  31760 R  25.0  17.3   7:12.33 Thread-+
jenkins   20   0 4644172 593472  31760 R  25.0  17.3   4:42.78 Thread-+
jenkins   20   0 4644172 593472  31760 R  25.0  17.3   4:01.22 Thread-+
jenkins   20   0 4644172 593472  31760 R  18.8  17.3  11:10.39 Thread-+
jenkins   20   0 4644172 593472  31760 R  18.8  17.3   9:11.48 Thread-+
jenkins   20   0 4644172 593472  31760 R  18.8  17.3   7:01.86 Thread-+
jenkins   20   0 4644172 593472  31760 R  18.8  17.3   5:53.48 Thread-+
jenkins   20   0 4644172 593472  31760 R  18.8  17.3   4:51.11 Thread-+
jenkins   20   0 4644172 593472  31760 R  12.5  17.3   8:53.55 Thread-+
jenkins   20   0 4644172 593472  31760 R  12.5  17.3   5:42.96 Thread-+
jenkins   20   0 4644172 593472  31760 R  12.5  17.3   3:55.93 Thread-+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.01 java
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:04.03 java
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.48 GC Thre+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.00 G1 Main+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:03.02 G1 Conc+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:01.57 G1 Refi+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.08 G1 Serv+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:01.57 VM Peri+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.27 VM Thre+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.00 Referen+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.00 Finaliz+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.00 Signal +
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.05 Service+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.29 Monitor+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:32.09 C2 Comp+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:10.26 C1 Comp+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.00 Notific+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.00 Common-+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.44 GC Thre+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.59 GC Thre+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.49 GC Thre+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.19 Jetty (+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.01 Jetty (+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:02.03 Jetty (+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.36 Jetty (+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:04.67 Jetty (+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.29 Schedul+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.00 Java2D +
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.05 Launche+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.00 Jenkins+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:01.49 jenkins+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.01 JNA Cle+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.10 jenkins+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.12 jenkins+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.00 hudson.+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.07 jenkins+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.15 jenkins+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.08 jenkins+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.10 jenkins+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.09 jenkins+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.09 jenkins+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.11 jenkins+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.10 Jetty (+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.34 Jetty (+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.19 Jetty (+
jenkins   20   0 4644172 593472  31760 S   0.0  17.3   0:00.00 Attach +

		

We then used the yc-360 script to gather the full range of artifacts, and we uploaded the bundle to yCrash for root cause analysis. This gave us the following summary:

Fig: yCrash Root Cause Analysis

The yCrash tool has detected that we have abnormally high CPU usage in the device, and the most likely cause is the 12 looping threads within the CPUSpikerThread class. This is exactly what we would expect. It contains links to a fastThread report where we can see details of these threads. Clicking the first link gives us the stack traces.

Fig: Stack Trace of Looping Threads

Clicking the second link, ‘time-consuming threads’, gives this report:

Fig: CPU Consuming Threads

If we had used fastThread directly instead of using yCrash, we could have matched threads in a runnable state to the output of the top command on native ID from this section of the fastThread report.

Fig: fastThread Showing Threads Runnable Between Dumps

We would still have isolated the culprit threads, but using yCrash is quicker.

Once we have the stack trace of the problematic thread, we can use the package names of the classes shown to know which area of Jenkins is likely to be the source of the problem.

The table below shows how to use package names to indicate the areas to be investigated.

	Packages
Jenkins Core	hudson.model.*
	jenkins.model.*
Jenkins Agent	hudson.remoting.*
	org.jenkinsci.remoting.*
Plugins	.plugin,
3rd Party Libs	Various

In our case, the package indicates that the problem is in a third-party library, since it doesn’t match any of the Jenkins package naming patterns.

Documentation by package is available for Jenkins and plugins on the Jenkins Javadoc site.

Most open source third-party libraries also have Javadocs available. Search for ‘Javadoc’ and the package name.

3. Case Study: Garbage Collection Causing High CPU Usage in Jenkins

For this exercise, we set the heap size really low (64MB) in the JVM. To modify the heap size, we changed the JAVA_OPTS parameter in the Jenkins configuration file to include:

-Xms64m -Xmx64m

This is far too low for Jenkins, and as soon as we request the system to do any work, it freezes up because the GC is causing a serious CPU spike. We were unable to even log in.

We ran the yc-360 script and submitted the bundle to yCrash. This gave us the following summary:

Fig: yCrash Summary Showing GC Issues

Not surprisingly, we see that full GC cycles are running consecutively as the JVM tries and fails to clear enough memory. Clicking on the ‘details’ link takes us to a GC log report prepared by the GCeasy tool. First, we see some recommendations:

Fig: GC Recommendations by GCeasy

If we look further down the report, we see statistics of JVM memory usage:

Fig: JVM Memory Usage by GCeasy

As soon as we see peak usage close to the allocated size, we know the heap has been under-configured.

Also in the report, we see key performance indicators:

Fig: Key Performance Indicators

Latency, the time that the JVM has paused all application threads during stop-the-world GC events, is very high in this chart.

We can also see a graph of heap usage over time, with full GC events shown by red markers.

Fig: Heap Usage Over Time

This graph is important, because it lets us distinguish between under-configuration and a memory leak when we are experiencing GC CPU spikes. Here, the memory usage is constantly high, and GC is not able to bring it down to an acceptable level. Whenever there is activity in Jenkins, the GC frantically tries to reclaim space, resulting in back-to-back full GC events.

If it had been a memory leak, we would have seen a different pattern, something like this:

Fig: Memory Leak GC Pattern

Here, memory usage is initially acceptable, and the GC is able to clear a certain amount of memory. However, it’s never able to bring it back down to the same level, and the bottom line keeps increasing until memory is full and GC events run continuously.

How to fix a CPU spike in Jenkins

The fix will be entirely different between Jenkins-related spikes and GC-related spikes. We’ll deal with them separately.

1. Resolving High CPU Usage in Jenkins Core and Plugins

Before looking deeper for a fix, it’s always a good idea to make sure all plugins are up to date, since fixes for known problems are regularly released. It’s also worth upgrading to the latest version of Jenkins for the same reason.

Make sure builds aren’t being run on the Jenkins controller. All builds should be directed to an agent.

Always check the Jenkins log for error messages, since unexpected errors can be the cause of high CPU usage. If you haven’t configured Jenkins to make std.out messages available from the master console, you can still see them:

In Linux, use sudo journalctl -u Jenkins. If necessary, you can filter the messages using grep.
In Windows, you can view log files under the $JENKINS_HOME directory.

The next step is to use the package name from the stack trace to decide on a suitable fix. The table below shows packages that have been recorded in connection with CPU spikes, and suggested fixes.

Package pattern	Cause	Suggested Fix
org.jenkinsci.plugins.workflow	Complex pipelines; many parallel branches; CPS overload	Simplify pipelines
org.codehaus.groovy.*	Complex Groovy scripting	Simplify scripting
hudson.remoting.*	Large transfers; excessive console output; heavy channel serialization; excessive agent reconnects	Monitor network; reduce console output, carry out large transfers out of peak hours
git	Source control management	Use webhooks or reduce polling frequency
hudson.model.Queue., jenkins.model.	Heavy load on build queue	Add more agents
jenkins.branch.*	SCM	Increase scan intervals or switch to webhooks
hudson.FilePath.*	File system scans	Clean up file systems; use local file systems; reduce area to be scanned
hudson.security.*	Authentication	Update plugins; reduce authorization requests
java.util.zip.*	Compression/decompression	Reduce transfer volumes; schedule large backups during off-peak
org.jenkinsci.plugins.xunit.* hudson.plugins.testng.*	Heavy test result processing (eg Junit merging)	Generate few large files rather than many small ones; narrow the file system areas to be scanned; merge test results on agent
hudson.model.Run., hudson.model.Job.	Bloated build histories	Clean up old builds

2. Resolving GC-Induced CPU Spikes in Jenkins

As discussed in the previous section, heap usage patterns will tell us whether the problem is caused by under-configuration or a memory leak.

Memory leaks are almost always caused by plugin problems, so if updating plugins doesn’t help, try removing recently-installed plugins and see if it resolves the problem. Occasionally, memory leaks can also be caused by bugs in Groovy scripts. If the problem occurs when a certain script is running, it may be the source of the problem, and you’ll need to fix the script.

If the diagnostics show that Jenkins is using a very high percentage of the allocated heap size, you’ll need to increase it by using the -Xms and -Xmx JVM arguments. You can do this by adding these parameters to the $JAVA_OPTS entry in the Jenkins configuration file.

When these fixes don’t help, you may need to tune the garbage collector. By default, Jenkins uses the G1GC algorithm, and you can find details of how to tune it in this article: G1 GC Tuning Tips.

Can Jenkins CPU Spikes be Predicted?

CPU spikes aren’t always predictable, since they often happen in response to an unusual combination of circumstances. However, in most cases, there are signs we can pick up through careful system monitoring. Prevention is always better than cure, so if we can spot the early signs, we can fix the problem before it disrupts our workflow.

For potential spikes related to Jenkins core and plugins, we can monitor for an overall increase in CPU usage by checking regularly using top. We can also monitor thread dumps. Always take three thread dumps with a brief interval between them. If we load all three at once into fastThread, it does a comparison. Check for the following:

An increase in the number of threads in the RUNNABLE state;
Threads that remain runnable with an identical stack trace across the three dumps;
Threads that are frequently active in compression/decompression packages.

To make sure GC is working efficiently, we should regularly monitor GC logs. Check for:

Reduced throughput or increased latency;
Increased object creation rate;
Unhealthy heap usage patterns;
GC full events are happening too frequently.

A good option is to run the yCrash sampling agent continually in the background. It has a very low overhead and monitors micrometrics to pick up potential performance problems proactively. If it picks up any pending issues, it takes full diagnostics, raises an alert, and produces a comprehensive report.

Conclusion

Jenkins CPU spikes can cause anything from occasional annoyance to making the system virtually unusable. They can either be caused by issues within Jenkins and its plugins or by inefficient garbage collection.

The three invaluable diagnostic artifacts are the operating system’s analysis of CPU usage by thread (top -H -p <pid> or Windows task manager), thread dumps, and GC logs.

Since all three are essential and should be taken at the same time during problem periods, the best option is to use the free yc-360 script to capture a full range of diagnostics and upload the bundle for root cause analysis.

Troubleshooting CPU Spikes in Jenkins

Common Symptoms of a Jenkins CPU Spike

What are the Root Causes of CPU Spikes in Jenkins?

1. Garbage Collection (GC) Performance

2. Jenkins Controller, Agent, or Plugin Performance

How to Troubleshoot CPU Spikes in Jenkins

1. What Data Should You Collect During a Jenkins CPU Spike?

a) Operating system live diagnostics showing CPU usage by thread

b) Thread dump in Jenkins

c) Jenkins GC logs

2. Case Study: Diagnosing a Jenkins CPU Spike Caused by a Plugin

3. Case Study: Garbage Collection Causing High CPU Usage in Jenkins

How to fix a CPU spike in Jenkins

1. Resolving High CPU Usage in Jenkins Core and Plugins

2. Resolving GC-Induced CPU Spikes in Jenkins

Can Jenkins CPU Spikes be Predicted?

Conclusion

You may also like

Share your Thoughts!Cancel reply

About

Popular Topics

Troubleshooting Tools

Common Symptoms of a Jenkins CPU Spike

What are the Root Causes of CPU Spikes in Jenkins?

1. Garbage Collection (GC) Performance

2. Jenkins Controller, Agent, or Plugin Performance

How to Troubleshoot CPU Spikes in Jenkins

1. What Data Should You Collect During a Jenkins CPU Spike?

a) Operating system live diagnostics showing CPU usage by thread

b) Thread dump in Jenkins

c) Jenkins GC logs

2. Case Study: Diagnosing a Jenkins CPU Spike Caused by a Plugin

3. Case Study: Garbage Collection Causing High CPU Usage in Jenkins

How to fix a CPU spike in Jenkins

1. Resolving High CPU Usage in Jenkins Core and Plugins

2. Resolving GC-Induced CPU Spikes in Jenkins

Can Jenkins CPU Spikes be Predicted?

Conclusion

You may also like

Share your Thoughts!Cancel reply

About

Popular Topics

Troubleshooting Tools

Discover more from yCrash