The Dell Boomi Core servers experienced a performance outage in the early morning at 4 a.m. Below are the symptoms observed:
- Dell Boomi wasn’t responding to any incoming requests
- CPU consumption was skyrocketing to 100%
Entire Dell Boomi cluster had to be recycled to recover from the unresponsiveness. In this post, we have shared the approach taken to identify and resolve the root cause of the problem.
Diagnosing Dell Boomi Issues with yCrash
To pinpoint the root cause of the Dell Boomi performance outage, we turned to yCrash, a specialized Java root cause analysis solution. yCrash is our trusted ally when it comes to monitoring the micro-metrics of the application, enabling us to forecast and address performance hiccups swiftly.
When yCrash detects these hiccups in the environment, it goes to work by capturing a comprehensive range of 360° artifacts from the application. These artifacts include critical elements such as garbage collection logs, thread dumps, heap dumps, application logs, kernel logs, and various system-level statistics. These statistics encompass vital information such as CPU usage (top), multithread process statistics (top -H), network statistics (netstat), virtual memory statistics (vmstat), I/O statistics (iostat), and disk usage data.
Once yCrash has gathers all these artifacts, it processes them meticulously. The result is an instant root cause analysis report presented on the dashboard. This report sheds light on the underlying issues that contributed to the Dell Boomi performance outage, paving the way for a precise and effective resolution.
Dell Boomi’s Memory Leak
The root cause of the Dell Boomi performance outage came into sharp focus when we examined the root cause analysis report generated by yCrash. Let’s dive into the details revealed by this analysis.
In Figure 1, yCrash’s Garbage Collection log analysis summary provided a crucial insight: the Dell Boomi core servers were grappling with a memory leak, manifesting as long garbage collection pauses. This discovery pinpointed a significant issue within the Dell Boomi environment.
This pattern of Full GC events consecutively running without reclaiming memory indicates one of the following:
- The application is suffering from a memory leak.
- The application is generating excessive objects beyond the allocated memory capacity.
For more in-depth insights into this type of Garbage Collection behavior, you may refer this post.
What triggered the Memory Problem in Dell Boomi?
In our pursuit to unravel the Dell Boomi performance outage, we first ruled out the possibility of a memory leak. This decision was grounded in the fact that no recent code changes or Dell Boomi upgrades had been implemented recently. Therefore, our attention shifted to the second possibility: the application generating excessive objects beyond its allocated memory capacity. Such a scenario typically occurs when there’s a surge in traffic volume. It was also evident when looking at the Heap report in yCrash pointing out the creation of Scheduled Job objects in massive numbers.
Our investigation led us to scrutinize whether there was a sudden surge in traffic volume around the time of the outage, particularly at 4 a.m. However, the data revealed that the traffic volume at that hour was minimal, as it was early morning. It was at this juncture that the proverbial light bulb turned on. We delved further and unearthed a critical piece of the puzzle.
Figure 3 paints a revealing picture: a series of scheduled batch jobs were initiated from the Dell Boomi environment at 2 a.m. Ordinarily, these jobs are expected to conclude within a matter of minutes. However, on this specific date, they ran for over two hours, a duration that raised our suspicions considerably.
The prolonged execution of these scheduled jobs became a significant point of interest in our investigation, leading us to a deeper understanding of the root cause of Dell Boomi’s memory problems.
Long running Dell Boomi Scheduled Jobs
Apparently, on the day the application suffered from the outage, the scheduled jobs loaded several million records from the database for processing. Astonishingly, this load was approximately ten times more than the typical daily load. The increase in records translated to a substantial increase in workload on the Dell Boomi servers. More workload equated to more objects created on the Dell Boomi core servers. Consequently, the application was under tremendous memory pressure, leading it to become unresponsive.
The path to resolution began with a thorough investigation. It was soon discovered that a glitch in the Dell Boomi Scheduled Job’s SQL query was responsible for loading an excessive number of records from the database. The query, on this particular day, pulled in an unnecessary volume of records, far surpassing what was required.
Once this SQL query was corrected, it loaded the appropriate number of records, thus limiting the workload on the Dell Boomi servers. The result was a substantial reduction in memory pressure on the application. With this issue resolved, the application regained its normal performance and returned to operating smoothly.
In the world of Dell Boomi, precision is paramount. The recent performance outage, brought to light by yCrash’s expert analysis, revealed a memory leak triggered by long-running scheduled batch jobs. By identifying and rectifying the issue at its source, we restored Dell Boomi’s stability. This experience emphasizes the importance of vigilance and the right tools for maintaining a robust IT environment.