Site icon yCrash

Troubleshooting .NET Production problems using AI

Several mission critical applications in the world are powered by .NET. It’s one of the largest platforms in enterprise space. However, there are challenges in troubleshooting .NET problems effectively & quickly. In this post, let’s discuss effective techniques to troubleshoot .NET production problems. 

Video

Watch the below recording to explore practical strategies, real-world scenarios, and actionable insights on using AI to troubleshoot .NET production issues faster and more effectively.

Present Challenges in .NET Troubleshooting

Today we have several observability tools in the market that monitor the health of .NET applications. They are good at detecting the problems, capturing the 3 pillars of observability (i.e. Logs, Metrics & Traces) & reporting them. These observability tools are good at telling what problem happened. Example:

a. CPU spiked by x%

b. Memory shot up by y%

c. Response time degraded by z seconds

However, they are not good at reporting the root cause of several of production problems. Example, they don’t report:

16 Artifacts to Troubleshoot .NET Applications

Observability tools are unable to tell such precise root cause because of the following reasons:

  1. Surface Level Metrics: Current Observability tools primarily capture following 3 artifacts: Logs, Metrics & Traces (which are called 3 pillars of observability). These are wonderful artifacts, but they are surface level artifacts. They are not sufficient to identify the root cause of the production problems as shown above. This is where yCrash differentiates itself. It captures 16 troubleshooting artifacts (along with logs & metrics) to identify the root cause of the production problem. 

Fig: 16 artifacts captured by yCrash for troubleshooting .NET applications

Here is a quick overview of these artifacts

 i) Runtime Artifacts:

 ii) System Level Artifacts:

2. Not Bridging Runtime & System-Level Artifacts: APM tools has Infrastructure/Machine agents that collect CPU, Memory, Disk, and Network metrics; however it lacks the granularity, correlation and depth. For example, APM will tell overall system’s CPU utilization is at 80%. However, yCrash analyzes synchronized snapshot. yCrash captures ‘top -H (thread-level CPU)’ which maps a specific thread in the application and reports them as the bottleneck. So as an end-user, you will get to know which thread is causing the CPU spike and what lines of code that the thread is working upon. 

Forecasting Outage

Observability is about “What happened?”. However, it would be nice to learn more about “What is going to happen?”

Present APM tools monitors primarily these 3 Macro-Metrics: CPU, Memory & Response Time. These are essential metrics, however they are more reactive indicators i.e., only after problem has surfaced in the production environment, they trigger alerts. It is good, but not good enough. We want to be able to predict outages. If we can predict outages, negative customer impact can be minimized. 

Outages can be predicted if the 9 different Micro-Metrics of the application stack can be monitored. In this post, let’s discuss one single Micro-Metric in this post to give you a flavor. It is “GC Behavior”. 

Garbage Collection heavily influences the application’s performance, studying the Garbage Collection behavior will facilitate you to forecast the memory related bottlenecks. 

Fig: Garbage Collection Behavior of a Healthy Application

The above graph shows the GC behavior of a healthy application. You can see a beautiful saw-tooth GC pattern. You can notice that when the heap usage reaches ~5.8GB, ‘Full GC’ event (red triangle) gets triggered. When the ‘Full GC’ event runs, memory utilization drops all the way to the bottom i.e., ~200MB. Please see the dotted black line in the graph which connects all the bottom points. You can notice this dotted black line is going at 0°. It indicates that the application is in a healthy state & not suffering from any sort of memory problems. 

Fig: Application Suffering from Acute Memory Leak in Performance Lab

Above is the garbage collection behavior of an application that is suffering from an acute memory leak. When an application suffers from this pattern, heap usage will climb up slowly, eventually resulting in OutOfMemoryException. 

In the above figure, you can notice that ‘Full GC’ (red triangle) event gets triggered when heap usage reaches around ~8GB. In the graph, you can also observe that amount of heap that full GC events could recover starts to decline over a period, i.e.

a. When the first Full GC event ran, heap usage dropped to 3.9GB 

b. When the second Full GC event ran, heap usage dropped only to 4.5GB

c. When the third Full GC event ran, heap usage dropped only to 5GB

d. When the final full GC event ran heap usage dropped only to 6.5GB

Please see the dotted black line in the graph, which connects all the bottom points. You can notice that black line is going at 15°. This indicates that this application is suffering from an acute memory leak. If this application runs for a prolonged period, it will experience OutOfMemoryException. However, in our performance labs, we don’t run the application for a long period. 

When this application is released into production, you will see the following behavior:

Fig: Application Suffering from OutOfMemoryException in Production

In the above graph, towards the right side, you can notice that Full GC events are continuously running, however memory size doesn’t drop. It’s a clear indication that application is suffering from memory leak. When this pattern happens, already customers would have been impacted and it’s too late to catch the problem.

yCrash: The Deterministic Root Cause Analysis

yCrash cleverly leverages these 9 micro-metrics to forecast the outages in .NET applications. If outages are detected, it captures 16 earlier mentioned artifacts to identify the root cause of the production problem. Here is the workflow of the yCrash’s wholistic troubleshooting solution: 

1) Capture Micro-Metrics: Unlike other monitoring tool agents, yCrash agent runs in a ‘non-intrusive mode’ (i.e. it runs outside the .NET runtime).  Every 3 minutes (by default) it captures following artifacts from your application (which are the source for micro-metrics):

2) Transmit to yCrash Server: yCrash agent transmits the Micro-Metrics captured in step #1 to yCrash Server for analysis in a secure https protocol.

3) ML algorithms & Pattern recognition: yCrash Server employs advanced Machine Learning algorithms and pattern recognition technologies to analyze Micro-Metrics comprehensively. This analysis aims to detect potential issues or anomalies brewing within the application.

4) Proactive Forecasting: If problems are forecasted based on the analysis, the yCrash Server instructs the agent to capture 360° troubleshooting artifacts that are essential to identify the root cause of the problem. Subsequently, these data are meticulously analyzed and reported as detailed incident reports in the yCrash dashboard.

Conclusion

Hopefully the techniques that yCrash employs, i.e. Micro-Metrics Monitoring to forecast the outage and the 360° artifacts that it captures to analyze the root cause will help your organization to resolve the .NET production problems effectively.

Exit mobile version