Micro-Metrics Every Performance Engineer should validate before Sign-off

Before the new release goes to production, we performance engineers conduct a stress test on the application to certify the release. Despite our sincere efforts, sometimes performance problems surface in the production environment. These performance problems slip through the cracks in our labs, due to few valid reasons (such as non-identical production and performance environment, usage of synthetic data, lack of long running endurance tests, absence of real-world chaos, disconnect between development & performance engineers). However, if we start to certify the release not only based on Macro-Metrics, but also on Micro-Metrics, we can stop this slippage. In this post let’s discuss what Micro-Metrics are, why they are important and how to source them.

Video

Watch this video to

  • Learn the 9 essential micro-metrics that influence JVM application performance under the hood.
  • Discover how GC patterns, object creation rates, and thread transitions can signal instability.
  • Understand why macro-metrics aren’t enough to guarantee production readiness.
  • Identify blind spots in traditional performance testing—and how to eliminate them.
  • Gain actionable strategies to improve sign-off quality and reduce performance regressions post-release.

What is the difference between Macro-Metrics & Micro-Metrics?

When you go to a hospital, Doctors check for vital signs: heart rate, body temperature, and blood pressure. These give you a quick sense of whether things are generally okay. These are Macro-Metrics.

Micro-Metrics, on the other hand, are like lab tests: blood work, MRI scans, cholesterol levels, liver enzyme counts. These don’t always show obvious symptoms upfront, but they reveal deeper issues that can cause major health problems if left unchecked.

Today in our performance labs, we monitor Macro-Metrics, they might look fine. But if you’re not tracking the underlying Micro-Metrics, you could miss early signs of instability. These hidden symptoms often lead to performance incidents once the release hits production. Thus examining micro-metrics closely before sign-off, you increase your chances of catching the silent degradations that don’t show up until it’s too late.

What are Macro-Metrics & Micro-Metrics?

Below Table shows the primary Macro-Metrics that we monitor today & the Micro-Metrics that we need to monitor in Performance Labs:

Macro-MetricsDescription
Response timeMeasures how long it takes for core transactions to complete.
CPU utilizationPercentage of CPU resources used by the application during load.
Memory utilizationTotal memory consumed by the application during execution.

Micro-MetricsWhat Problems They Help Catch
GC Behavior PatternDetects memory leaks, poor GC configuration, or excessive object promotion causing GC pauses.
Object Creation RateIdentifies allocation surges that can trigger frequent GCs or memory pressure.
GC ThroughputHighlights apps spending too much time in GC instead of work—can lead to CPU spikes or slowdowns.
GC Pause TimeSurfaces stop-the-world GC events affecting responsiveness or causing thread backlogs.
Thread PatternsFlags CPU spikes, thread starvation, bursty load, and thread buildup from backend slowness.
Thread StatesDetects BLOCKED, DEADLOCKED, or WAITING threads due to DB chattiness, config limits, or locking.
Thread Pool BehaviorIdentifies thread exhaustion, request rejections, or poor pooling thresholds in backend services.
TCP/IP Connection Count & StatesCatches backend connection leaks, TIME_WAIT surges, or slow/unresponsive downstream services.
Error Trends in Application LogsDetects hidden runtime errors, JDBC leaks, logging misconfigurations, or disk issues.

Why Certify Based On Micro-Metrics?

Micro-Metrics should be validated for two primary reasons:

1. Unearth Hidden Problems

2. Find Solutions to Complex Problems

Let’s review them in this section.

1. Unearth Hidden Problem

Micro-Metrics helps to unearth several silent degradations in the application. Let me highlight a few important ones:

a. Acute Memory Leak: Garbage Collection heavily influences the application’s performance, studying the Garbage Collection behavior, a key Micro-Metric will facilitate you to forecast the memory related bottlenecks. 

Fig: Garbage Collection Behavior of a Healthy Application

The above graph shows the GC behavior of a healthy application. You can see a beautiful saw-tooth GC pattern. You can notice that when the heap usage reaches ~5.8GB, ‘Full GC’ event (red triangle) gets triggered. When the ‘Full GC’ event runs, memory utilization drops all the way to the bottom i.e., ~200MB. Please see the dotted black line in the graph which connects all the bottom points. You can notice this dotted black line is going at 0°. It indicates that the application is in a healthy state & not suffering from any sort of memory problems. 

Fig: Application Suffering from Acute Memory Leak in Performance Lab

Above is the garbage collection behavior of an application that is suffering from an acute memory leak. When an application suffers from this pattern, heap usage will climb up slowly, eventually resulting in OutOfMemoryError

In the above figure, you can notice that the ‘Full GC’ (red triangle) event gets triggered when heap usage reaches around ~8GB. In the graph, you can also observe that amount of heap that full GC events could recover starts to decline over a period, i.e.

  • When the first Full GC event ran, heap usage dropped to 3.9GB 
  • When the second Full GC event ran, heap usage dropped only to 4.5GB
  • When the third Full GC event ran, heap usage dropped only to 5GB
  • When the final full GC event ran heap usage dropped only to 6.5GB

Please see the dotted black line in the graph, which connects all the bottom points. You can notice that the black line is going at 15°. This indicates that this application is suffering from an acute memory leak. If this application runs for a prolonged period, it will experience OutOfMemoryError. However, in our performance labs, we don’t run the application for a long period. 

When this application is released into production, you will see the below behavior:

Fig: Application Suffering from OutOfMemoryError in Production

In the above graph, towards the right side, you can notice that Full GC events are continuously running, however memory size doesn’t drop. It’s a clear indication that the application is suffering from memory leaks. When this happens, already customers would have been impacted and it’s too late to catch the problem.

Thus, observing GC behavior in the performance lab will facilitate you to catch OutOfMemoryError early in the game. 

b. Connection Leak

Fig: Network Connection Count & State originating from the application

Modern enterprise applications communicate in various protocols: HTTP, HTTPS, JMS, Kafka, REST, SOAP, gRPC, WebSockets, FTP, SFTP, MQTT, AMQP, and proprietary TCP-based protocols with a wider range of external systems: Databases, Payment Gateways, AI platforms, …  

Let’s say in the new release developer accidentally introduces too many (i.e. chatty calls) with external systems or fails to close the network connections properly. It has the potential to overwhelm the external systems and lead to outages, especially under peak production load. Thus, monitoring an important Micro-Metrics ‘number of TCP/IP connections and states’ (e.g., ESTABLISHED, TIME_WAIT, CLOSE_WAIT), will help us to detect:

  • Connection leaks: Connections that are not properly closed.
  • Chattiness: An excessive number of small, frequent calls to external systems.
  • Idle Connections: Connections left open without active use, consuming resources.

These issues often surface only under production load over a period of time, thus tracking the TCP/IP connection Micro-Metrics in performance labs can expose the risks before they escalate into production outages.

c. Thread Leak: Threads are like the blood system of the application. Any performance problems that happen in the application will have a reflection on the threads behavior. Thus, studying the Micro-Metric ‘Threads behavior’ closely will give a good indication of brewing performance problems in the application.

Let’s say in a new release, your developer is creating too many threads or he forgets to terminate the threads after its usage, then threads will start to build up in the native memory region of the application. Thus examining the thread count and the thread states will give an early indication of the performance problem.

Here is a case study of an application suffering from thread leak and how thread analysis facilitated them to resolve the problem.

2. Find Solutions to Complex Problems

Studying Micro-Metrics not only facilitates us to forecast the production outages, but also find solutions to complex problems. In this section let me highlight a couple of scenarios, where micro-metrics facilitate you to discover solutions to such complex problems.

a. Lines of code causing CPU Spike: Today’s APM tools reports to us that the CPU of the overall process has spiked up. But for us it would be more helpful if a tool can report us the exact lines of source code that is causing the CPU to spike up in a non-intrusive manner. When you use the yCrash tool which reports these Micro-Metrics, it will point out the exact line of code that is causing the CPU to spike. Refer to this post, if you want to learn how yCrash diagnoses CPU spikes non-intrusively.

b. Where threads are getting BLOCKED: Modern APM tools report to us that response time has degraded. However, when you use the yCrash tool, it captures the Micro-Metrics ‘Thread’s behavior’ and analyze each thread’s stacktrace and present a transitive dependency graph showing what threads are blocking what other threads and exact lines of code causing of the problem as shown in the below figure. If you want to learn about real case study, check out this blog on Java UUID generation and its performance impact.

Fig: yCrash tool reporting Blocked Threads as Transitive Dependency Graph

How to Source these Micro-Metrics?

All these 9 critical micro metrics can be captured from one tool – yCrash. You can follow the instructions given here to capture these micro metrics. In fact, all the micro metrics screenshots used in this post are excerpts from the yCrash incident report. yCrash incident report contains several metrics. Below table summarizes from where to source these micro metrics in the yCrash incident report:

Micro MetricSource
0. Problems DetectedIncident Report > Summary > All the reported Problems
1. GC Behavior PatternIncident Report > Garbage Collection > Heap Usage Graph
2. Object Creation RateIncident Report > Garbage Collection > Object Stats
3. GC ThroughputIncident Report > Garbage Collection > Key Performance Indicators
4. GC Pause TimeIncident Report > Garbage Collection > Key Performance Indicators
5. Thread PatternsIncident Report > Threads > Flame Graph
6. Thread StatesIncident Report > Threads > Thread Count Summary
7. Thread Pool BehaviorIncident Report > Threads > Thread Pools
8. TCP/IP Connection Count & StatesIncident Report > Network > Host Connections
9. Error Trends in Application LogsIncident Report > App Logs > Errors

For optimal results, we strongly recommend following the best practices to capture Micro-Metrics.

Best Wishes To Become Top 1% of Engineers

Try having a conversation about the Micro-Metrics that we have discussed in the post with very senior engineers and architects of your organization. You will start to see a question mark flashing in their faces. It’s because Micro-Metrics are quite deep knowledge and held only by the top 1% of engineers. When you are reporting these metrics and articulating them effectively you are not only going to catch production problems in the performance labs but also going to look sharp amongst your peers and upper management. It has potential to accelerate your career growth. Best wishes to become that top 1% of engineers. Good luck!!

Share your Thoughts!

Up ↑

Index

Discover more from yCrash

Subscribe now to keep reading and get access to the full archive.

Continue reading