Monitoring != Root cause analysis

APM gives news, yCrash gives answer

Industry has seen cutting-edge Application Performance Monitoring tools (i.e., AppDynamics, NewRelic, Dynatrace) and infrastructure monitoring (i.e., Nagios, Ngmon) tools. These monitoring tools are great at detecting the symptoms of the problems. i.e., they can detect CPU spiked by x%, memory degraded by y%, and response time shot up by z seconds. But they dont answer the question: Why did the CPU spike up? Why is memory degraded? Why is response time increased?

To answer these questions, you need to capture & analyze various application and system level artifacts such as garbage collection logs, thread dumps, heap dumps, nestat, iostat, vmstat, top, disk usage, kernel parameters, … and several more artifacts from your application to identify the root cause of the problem. yCrash – the root cause analysis tool does this job for you automatically. yCrash captures and analyzes these artifacts and identifies the root cause of the problem instantly.

Non-intrusive

APM agents run within your application’s JVM. They intercept every single call, adding siginificant overhead to your application. Even though APM agents claim to add less than 3% overhead, it’s far from reality.

yCrash agent runs on the device and NOT WITH IN THE JVM. It analyzes only the data which is already generated by the application. Thus yCrash doesn’t add any observable over-head to your application.

Micrometrics

APM reports macrometrics like CPU utilization, Memory utilization, response time, component level response time.

yCrash reports and analyzes micrometrics like GC throughput, Object creation rate, object promotion rate, GC latency, thread states, thread group size, list of open file descriptors, … For more details on the micrometrics, refer to this article. Using these micrometrics yCrash can predict and forecast outages before it happens.

Taste of pudding is in eating

There is an old English proverb – “Taste of the pudding is in the eating”. Similarly below are few there scenarios where yCrash has complimented the current monitoring solutions to identify the root cause of the problem

AppDynamics + yCrash

Recently a major telecom company had an outage due to memory leak in it’s application. Their APM solution generated alerts stating the memory was spiking up. Telecom company used yCrash to diagnose the problem. yCrash reported the exact objects that were causing the memory to leak. Apparently, it turned out that the leak was caused by AppDynamics agent itself that was running within the application. Below screenshot shows yCrash reporting the AppDynamics agent that is causing the memory leak.

Fig: Screenshot showing yCrash reporting the AppDynamics agent causing the memory task.

AWS Cloudwatch + yCrash

Online DevOps application GCeasy experienced an outage on Oct’ 11, 2021. When customers uploaded their Garbage Collection logs for analysis, the application was returning HTTP 504 error. HTTP 504 status code indicates that transactions are timing out. AWS Cloudwatch identified and reported that CPU utilization and DB connections count were spiking up. However yCrash reported the exact line of Java code which was causing the degradation in the response time. More details about this root cause analysis can be found here.

Leave a Reply

Powered by WordPress.com.

Up ↑

%d bloggers like this: