Key Challenges in Troubleshooting Customer On-Premise Applications

Ram Lakshmanan

1 year ago

Page Contents

Troubleshooting production problems in general is complex. It’s far more complex to troubleshoot the problem if our application runs inside a customer’s infrastructure, due to business requirements or strict security policies. We can’t sign-in into their system. We can’t attach a debugger. And we are dependent on customer support staff to capture the diagnosis data such as application log, configuration files etc., that we need for troubleshooting. This post outlines the key challenges when we try to diagnose issues in on-prem environments and effective solutions to address them.

Video

In this webinar video, Our expert speaker walked through the critical elements of diagnosing performance problems in environments you don’t control—from understanding what artifacts are essential to troubleshooting remotely, to building automation workflows that make these scenarios less painful.

Challenges in Troubleshooting Customer On-Premise Application

Below are the key challenges in troubleshooting our applications that are running on the customer premise.

1. Missing Troubleshooting Artifacts: Production issues can stem from a variety of reasons: memory leaks, network latency, backend slowdowns, storage issues, garbage collection pauses, and more. Diagnosing these issues requires various artifacts such as application logs, GC logs, thread dumps, heap dumps, etc. However, customers often fail to provide the complete set of artifacts. Even when we share clear instructions and command-line steps, the customer’s support staff may not follow them correctly. In some cases, our commands may not work in their environment due to differences in configuration or OS restrictions. These gaps make it significantly harder to pinpoint the root cause.

2. Reliance on Customer Staff for Timely Data Collection: Since we don’t have access to the customer’s environment, we depend on their support staff to capture the troubleshooting artifacts. But timing is everything. Application logs, Thread dumps, heap dumps, and GC logs must be collected while the issue is occurring, capturing them too early or after things have stabilized provides little value. A delay of even a few minutes can cause critical signals to be lost, dragging out the investigation and extending the outage.

3. Refusal to Share Artifacts For Security Concerns: Some customers may be unwilling or unable to share key troubleshooting artifacts like Application Logs, heap dumps, thread dumps, or configuration files as they contain sensitive information such as PII Data, IP Addresses, … Their internal policies & corporate compliance might restrict them from sharing these artifacts outside the network. Even when problems are severe, this refusal limits your ability to investigate and resolve issues effectively. You are left working with partial information, which increases diagnosis time and risk of incorrect assumptions.

4. Environmental Instability and Hidden Customizations: Sometimes the root cause isn’t our application, but the customer’s environment. Customer environment’s network issues could interfere with backend connectivity of our application. Customers might run an internal Cronjob which may consume an enormous amount of CPU cycles in the server, which in turn can affect our application’s availability. Yet, our application often gets blamed first. These situations are hard to validate without the right system-level artifacts.

To make matters worse, customers may apply custom configurations or environment-specific changes without informing us. These hidden modifications like tuning JVM flags, altering thread pool sizes, or introducing new dependencies can cause unexpected behavior that’s difficult to replicate in a standard environment. Without full visibility, identifying the root cause becomes a guessing game.

5. Misleading Communication About Problems: Customers may miscommunicate the issue. For instance, they might describe an OutOfMemoryError as a crash, even though the application might still be running. Or they may report high CPU usage when it’s actually a memory leak. Doctors struggle to diagnose correctly when a patient describes the wrong symptoms, similarly you are forced to spend extra cycles chasing misleading clues.

6. Unable to Reproduce the Problem Internally: Because the problem is tied to customer-specific data, load patterns, or infrastructure quirks, you may not be able to reproduce the issue in your own environment. This makes debugging speculative and slow, relying heavily on guesswork and back-and-forth interaction.

Effective Troubleshooting Strategy

As leaders, we sometimes need to decide on issues without complete knowledge. In such cases, we should gather sufficient data, digest them, and make informed decisions. We need to apply the same principle when troubleshooting production problems as well. We need to gather 360° troubleshooting artifacts from our application that is running on customer premises and then diagnose the problem. Below is the complete list of artifacts that are required to troubleshoot the production problems:

Artifact	What It Captures
Application Log	Logs generated by your application—useful for identifying functional failures
GC Log	Garbage collection activity—helps detect memory overuse, frequent GCs, pauses
Thread Dump	Snapshot of all threads in the JVM—key to spotting deadlocks, BLOCKED/stuck threads
Heap Dump	Memory snapshot of JVM objects—used to identify memory leaks or heavy objects
Heap Substitute	Lightweight version of heap dump when full heap dump isn’t available
top	Overall CPU/memory usage of system processes
ps	Snapshot of currently running processes
top -H	Thread-level CPU usage—helps isolate CPU-intensive threads
Disk Usage (df -h)	Available/used disk space—useful when app errors stem from full disks
dmesg	Kernel logs—catches low-level system issues like hardware errors or OOM kills
netstat	Network connections, open ports, and listening sockets
ping	Network latency to external or internal endpoints
vmstat	Virtual memory, I/O, and CPU scheduling stats
iostat	Disk I/O performance metrics
Kernel Parameters	System tuning configurations (like swappiness, max open files)
Extended Data	Any custom scripts or data you configure yc-360 script to collect
Metadata	Basic system/app metadata (JVM version, hostname, uptime, etc.)

How to Capture these 360° Troubleshooting Artifacts?

These 360° troubleshooting artifacts can be captured by using the lightweight open-source script yc-360 Script. This platform agnostic script captures all of the above-mentioned troubleshooting artifacts and bundles them into a zip file.

You can trigger the script using the instructions given here.

Key Features of ‘yc-360 Script’

Salient features of ‘yc-360 Script’ are presented in this section:

1. Platform Agnostic: yc-360 Script can run on with various environments: bare metal, virtual machines, containers, k8, open shift & more platforms. So, it doesn’t matter which platform customer runs our application, the script can work seamlessly.

2. Pristine Data Capture: Certain docker distributions contain bare minimal tools/commands. In such circumstances issuing netstat, top, … commands will not work. In such cases, script directly connects with the kernel and extracts the required information in a pristine format.

3. Extended Data Capturing Support: As mentioned in the above table, yc-360 Script captures 16 artifacts from the customer environment. Besides the 16 artifacts that the script captures, we might want to capture additional artifacts that are specific to our application like your application configuration settings, version information… Script facilitates you to capture such additional artifacts using the Extended Data feature.

4. Security and Compliance: Troubleshooting artifacts captured by the script can contain sensitive information such as IP Address, PII Data, … Customers might be hesitant to share such information. In that circumstance, script provides the capability to redact such sensitive information, which makes the artifacts more shareable. Learn more about security features.

5. Almost Zero Overhead: Script adds very minimal overhead to the environment.

CPU Consumption: Averages around 0.05%, with occasional spikes up to 3% during data collection.
Memory Consumption: Consistently between 0.1% and 0.2%.

These metrics indicate that yc-360 script can safely run in production environments without causing noticeable overhead. For more information, refer to the yc-360 script Overhead Performance blog post.

6. Completes Under 30 seconds: In most environments, yc-360 script completes execution in under 30 seconds. However, collecting heap dumps may take longer depending on the heap size and system performance. Thus, by default, script doesn’t capture Heap Dump.

What are the advantages of capturing 360° Troubleshooting Artifacts

There are several advantages in pursuing this Strategy. Here are some of the key ones:

1. Simplifies the Root Cause Identification Process: Since we have all the necessary artifacts to troubleshoot the problem, identifying root cause analysis can be a lot more easier.

Once you have the 360° troubleshooting artifacts bundle zip file, you can either analyze them manually or with individual tools for analyzing each of these artifacts. At the same time, you can use yCrash’s bundle analysis feature, which can analyze all these artifacts and generate a root cause analysis incident report instantly.

2. Eliminates Wrong Symptoms Reported By the Customers: Even if a customer reports symptoms wrongly, you have proper artifacts to validate the behavior. Just like how a doctor draws blood samples, urine samples… to confirm (or disconfirm) the symptoms reported by the patient, these artifacts will provide actual insight that is happening on the customer site.

3. Minimizes Friction & Back-and-Forth Communication: Since we are automating the artifacts capturing process, we don’t have to share every command/instruction that is required to capture the artifact. It heavily minimizes the back-and-forth Communication that we would typically have with the customer. This saves us an enormous amount of time and helps to focus on identifying the root cause of the problem.

You can mandate the customer’s support staff to submit the 360° troubleshooting artifacts when they are opening the tickets with you.

Real World Case Studies

Here are a few real-world case studies, which show how 360° artifacts facilitated to identify complex production problems in major enterprises:

Insurance Platform Increased Throughput by 30%: An insurance company uncovered allocation bottlenecks through GCeasy, leading to significant throughput improvements.

North American Trading Platform Resolved Severe CPU Spikes: A major trading platform diagnosed concurrency issues causing CPU spikes and reduced diagnostic time significantly, avoiding revenue-impacting outages during market hours.

E-commerce Platform on AWS Resolved 502 Gateway Errors: An AWS-hosted service was experiencing 502 errors; thread dump analysis revealed blocked threads, leading to a fast resolution and reduced customer impact.

Final Thoughts

Free opensource yc-360 Script simplifies customer on-premise application troubleshooting process. It eliminates significant hassle in capturing the necessary troubleshooting artifacts. It minimizes the friction in back-and-forth communication between us and customers. It helps us to focus on diagnosing the problem.