Troubleshooting .NET Production problems using AI

Ram Lakshmanan

2 months ago

Page Contents

Several mission critical applications in the world are powered by .NET. It’s one of the largest platforms in enterprise space. However, there are challenges in troubleshooting .NET problems effectively & quickly. In this post, let’s discuss effective techniques to troubleshoot .NET production problems.

Video

Watch the below recording to explore practical strategies, real-world scenarios, and actionable insights on using AI to troubleshoot .NET production issues faster and more effectively.

Present Challenges in .NET Troubleshooting

Today we have several observability tools in the market that monitor the health of .NET applications. They are good at detecting the problems, capturing the 3 pillars of observability (i.e. Logs, Metrics & Traces) & reporting them. These observability tools are good at telling what problem happened. Example:

a. CPU spiked by x%

b. Memory shot up by y%

c. Response time degraded by z seconds

However, they are not good at reporting the root cause of several of production problems. Example, they don’t report:

Application latency increased significantly due to a thread contention issue. ‘Thread-A’ acquired a synchronization lock but failed to release it, resulting in a deadlock (or thread starvation) that blocked 183 downstream threads.
The managed heap experienced a memory leak involving approximately 500,000 unreclaimed instances of ‘Object-ABC’. This excessive allocation exhausted the available address space, triggering a ‘System.OutOfMemoryException’.
‘Thread-X’ entered an infinite execution loop at ABC.cs line 457, resulting in a sustained CPU spike. This prevented the processor from handling other scheduled tasks efficiently.

16 Artifacts to Troubleshoot .NET Applications

Observability tools are unable to tell such precise root cause because of the following reasons:

Surface Level Metrics: Current Observability tools primarily capture following 3 artifacts: Logs, Metrics & Traces (which are called 3 pillars of observability). These are wonderful artifacts, but they are surface level artifacts. They are not sufficient to identify the root cause of the production problems as shown above. This is where yCrash differentiates itself. It captures 16 troubleshooting artifacts (along with logs & metrics) to identify the root cause of the production problem.

Fig: 16 artifacts captured by yCrash for troubleshooting .NET applications

Here is a quick overview of these artifacts

i) Runtime Artifacts:

Memory Dump: Whenever there is a memory leak, one has to capture Memory Dump, as it contains information about all the objects that are present in memory, there reference amongst other objects and actual user data.
Thread Snapshot: Threads are like the blood system of the application. Whenever there is a problem in any body parts, there will be reflection of it in the blood. That’s why doctor draws blood samples first to understand what’s going on in our body. Similarly whenever any part of the application gets hurt, it will be reflected in the threads.
GC Diagnostic: Garbage Collection plays a crucial role in the performance of an application. If Garbage Collection either runs frequently or takes long time to complete, it will pause the application. Similarly, if there is any memory related problems (such as memory leak, OutOfMemoryException or over allocation of objects, it can be observed in the Garbage Collection behavior.

ii) System Level Artifacts:

top: Displays system-wide CPU and memory utilization and helps you identify whether system resources are saturated. It also reveals if other processes are consuming CPU or memory and whether CPU cycles are stolen by the kernel or hypervisor.
ps: Lists all running processes and helps you identify zombie, orphaned, or unexpected processes that consume resources or interfere with application performance.
top -H: Shows CPU usage at the thread level and helps you pinpoint which thread within the .NET runtime, or another process is consuming excessive CPU time.
Disk Usage (df -h): Displays available and used disk space and helps you identify low-disk or full-disk conditions that can lead to logging failures, crashes, or corrupted writes.
dmesg: Shows kernel-level system messages and helps you identify hardware issues, driver failures, Out-of-Memory (OOM) kills, and kernel panics affecting system stability.
netstat: Lists active network connections, open ports, and listening sockets and helps you identify stuck connections, port conflicts, or network floods.
ping: Measures network reachability and latency to remote endpoints and helps you identify network disconnections, packet loss, or routing delays.
vmstat: Reports virtual memory, I/O, and CPU scheduling metrics and helps you identify paging, swapping, CPU run queue build-up, blocked processes, and I/O waits.
iostat: Displays disk I/O throughput, utilization, and latency statistics and helps you identify slow or overloaded disks that contribute to high response times or bottlenecks.
Kernel Parameters: Lists important OS tuning parameters and helps you identify misconfigurations such as low file descriptor limits, restricted socket buffers, or suboptimal memory settings.

2. Not Bridging Runtime & System-Level Artifacts: APM tools has Infrastructure/Machine agents that collect CPU, Memory, Disk, and Network metrics; however it lacks the granularity, correlation and depth. For example, APM will tell overall system’s CPU utilization is at 80%. However, yCrash analyzes synchronized snapshot. yCrash captures ‘top -H (thread-level CPU)’ which maps a specific thread in the application and reports them as the bottleneck. So as an end-user, you will get to know which thread is causing the CPU spike and what lines of code that the thread is working upon.

Forecasting Outage

Observability is about “What happened?”. However, it would be nice to learn more about “What is going to happen?”

Present APM tools monitors primarily these 3 Macro-Metrics: CPU, Memory & Response Time. These are essential metrics, however they are more reactive indicators i.e., only after problem has surfaced in the production environment, they trigger alerts. It is good, but not good enough. We want to be able to predict outages. If we can predict outages, negative customer impact can be minimized.

Outages can be predicted if the 9 different Micro-Metrics of the application stack can be monitored. In this post, let’s discuss one single Micro-Metric in this post to give you a flavor. It is “GC Behavior”.

Garbage Collection heavily influences the application’s performance, studying the Garbage Collection behavior will facilitate you to forecast the memory related bottlenecks.

Fig: Garbage Collection Behavior of a Healthy Application

The above graph shows the GC behavior of a healthy application. You can see a beautiful saw-tooth GC pattern. You can notice that when the heap usage reaches ~5.8GB, ‘Full GC’ event (red triangle) gets triggered. When the ‘Full GC’ event runs, memory utilization drops all the way to the bottom i.e., ~200MB. Please see the dotted black line in the graph which connects all the bottom points. You can notice this dotted black line is going at 0°. It indicates that the application is in a healthy state & not suffering from any sort of memory problems.

Fig: Application Suffering from Acute Memory Leak in Performance Lab

Above is the garbage collection behavior of an application that is suffering from an acute memory leak. When an application suffers from this pattern, heap usage will climb up slowly, eventually resulting in OutOfMemoryException.

In the above figure, you can notice that ‘Full GC’ (red triangle) event gets triggered when heap usage reaches around ~8GB. In the graph, you can also observe that amount of heap that full GC events could recover starts to decline over a period, i.e.

a. When the first Full GC event ran, heap usage dropped to 3.9GB

b. When the second Full GC event ran, heap usage dropped only to 4.5GB

c. When the third Full GC event ran, heap usage dropped only to 5GB

d. When the final full GC event ran heap usage dropped only to 6.5GB

Please see the dotted black line in the graph, which connects all the bottom points. You can notice that black line is going at 15°. This indicates that this application is suffering from an acute memory leak. If this application runs for a prolonged period, it will experience OutOfMemoryException. However, in our performance labs, we don’t run the application for a long period.

When this application is released into production, you will see the following behavior:

Fig: Application Suffering from OutOfMemoryException in Production

In the above graph, towards the right side, you can notice that Full GC events are continuously running, however memory size doesn’t drop. It’s a clear indication that application is suffering from memory leak. When this pattern happens, already customers would have been impacted and it’s too late to catch the problem.

yCrash: The Deterministic Root Cause Analysis

yCrash cleverly leverages these 9 micro-metrics to forecast the outages in .NET applications. If outages are detected, it captures 16 earlier mentioned artifacts to identify the root cause of the production problem. Here is the workflow of the yCrash’s wholistic troubleshooting solution:

1) Capture Micro-Metrics: Unlike other monitoring tool agents, yCrash agent runs in a ‘non-intrusive mode’ (i.e. it runs outside the .NET runtime). Every 3 minutes (by default) it captures following artifacts from your application (which are the source for micro-metrics):

Application Log
Thread snapshot + top -H
Garbage Collection Diagnostics

2) Transmit to yCrash Server: yCrash agent transmits the Micro-Metrics captured in step #1 to yCrash Server for analysis in a secure https protocol.

3) ML algorithms & Pattern recognition: yCrash Server employs advanced Machine Learning algorithms and pattern recognition technologies to analyze Micro-Metrics comprehensively. This analysis aims to detect potential issues or anomalies brewing within the application.

4) Proactive Forecasting: If problems are forecasted based on the analysis, the yCrash Server instructs the agent to capture 360° troubleshooting artifacts that are essential to identify the root cause of the problem. Subsequently, these data are meticulously analyzed and reported as detailed incident reports in the yCrash dashboard.

Conclusion

Hopefully the techniques that yCrash employs, i.e. Micro-Metrics Monitoring to forecast the outage and the 360° artifacts that it captures to analyze the root cause will help your organization to resolve the .NET production problems effectively.