Grafana Series - Unified Display - 12 - RED Method Dashboard

This article was last updated on: May 17, 2026 am

Series Articles

Overview

Currently, there are three mainstream methods for monitoring metrics:

  • RED: Rate, Errors, Duration — introduced by @tom_wilkie
  • USE: Utilization, Saturation, and Errors — introduced by @brendangregg
  • Four Golden Signals: Latency (similar to Duration), Traffic (how much demand is placed on your system, similar to Rate), Errors, Saturation. Essentially RED + Saturation.

It is recommended to use both the RED and USE Methods together, where:

  • The RED Method focuses on your users and how happy they are
  • The USE Method focuses on your machines and how happy they are

Typical RED Method Monitoring Metrics

If implemented via Prometheus monitoring, typical metric examples are as follows:

Rate:

promql
sum(rate(request_duration_seconds_count{job=“…”}[1m]))

Errors:

promql
sum(rate(request_duration_seconds_count{job=“…”, status_code!~“2…”}[1m]))

Duration:

promql
histogram_quantile(0.99, sum(rate(request_duration_seconds_bucket{job=“…”}[1m])) by (le))

For Duration, it is recommended to use the 50th/90th/99th percentiles, as these more accurately reflect what users truly care about. You can also combine them with Average Duration as a reference.

A typical dashboard example looks like this:

RED Method Prometheus Grafana Dashboard

Hands-On - RED Method Dashboard Based on ES Access Logs

This was a workaround out of necessity — the developers did not implement request-related metrics using the Prometheus Client. Instead, they only recorded access logs and shipped them to ElasticSearch. So we had to work around this by using ES to aggregate logs, keywords, and Terms to achieve a similar effect. While it is achievable, in practice we found that ES-based monitoring performs significantly worse.

The result looks like this:

RED Method ES Grafana Dashboard

Rate (only able to achieve requests per minute):

Taking 2xx as an example, Query:

lucene
request_path.keyword:(-“/actuator/health” -“/metrics” -“info” -“Eureka”) AND status_code:[200 TO 299]

As shown above, this excludes common health check and monitoring URLs in microservice scenarios.

The Metric configuration below:

ES Rate Query - part

ES Rate Query

Errors:

Simply 5xx:

lucene
request_path.keyword:(-“/actuator/health” -“/metrics” -“info” -“Eureka”) AND status_code:[500 TO 599]

Duration:

The Query no longer differentiates by status code:

lucene
origin_path.keyword:(-“/actuator/health” -“/metrics” -“info” -“Eureka”)

Percentiles configuration:

Percentiles ES Query

  • Metric: Percentiles
  • Terms: select latency (in my case, latency records the Duration)
  • Values: select 50,99 to calculate the 50th and 99th percentiles

Average configuration:

Average Duration ES Query

  • Metric: Average
  • Terms: select latency

The final result looks like this:

RED Method ES Grafana Dashboard

Using the RED Method

To make it easier to use the RED Method for drill-down investigation, a simple association was set up using Grafana Dashboard Links for convenient navigation.

│ 📝Notes:

│ Grafana Dashboard Links will be covered in detail in a future article. 😜😜😜

Let’s do a hands-on analysis based on the dashboard above.

Over the past 7 days, we found a maximum concurrent Error count of 8, and the slowest p99 duration was 6.29s (above the 5s threshold). We want to find the specific logs for the corresponding errors and slow requests.

Analyzing Errors

For Errors, click the 5xx legend on the Rate & Errors Panel, which displays the following:

Errors

Select a time range in the panel to zoom in, as shown below:

Errors panel zoom in

We can see the errors occurred around 2023-05-11 15:58. Clicking the Dashboard Link in the upper right corner navigates to the ElasticSearch Quick Search Dashboard (the same time range is carried over during the jump).

Since we are analyzing error logs, we can select 500 TO 599 in the status_code variable to directly find the relevant logs, as shown below:

Error Logs

You can further pinpoint the issue via Logs to Trace.

Analyzing Slow Requests:

The approach is similar to analyzing error requests:

  1. Zoom in on the time range in the Durations panel — the approximate time is around 2023-05-09 14:52
  2. Navigate to the ElasticSearch Quick Search Dashboard via Dashboard Links
  3. Add an ad hoc filter: latency>5000 to find the specific log:

Slow Log

🎉🎉🎉

Summary

This article introduced three commonly used monitoring methods:

  • USE Method (machine-oriented)
  • RED Method (user-oriented)
  • Google SRE Four Golden Signals (RED + Saturation)

With a focus on the RED Method, it covered:

  • Metric collection and visualization with Prometheus + Grafana
  • Metric collection and visualization with ElasticSearch + Grafana

As well as practical analysis use cases based on the RED Method.

Hope this helps! 😄😄😄

References