Configuring SLO Monitoring and Alerting with Prometheus

This article was last updated on: May 17, 2026 am

Overview

As the de facto standard for cloud-native and container platform monitoring, let’s look at how to configure SLO monitoring and alerting with Prometheus.

SLO Alerting

For SLO alerting, based on Google SRE best practices, the following alert dimensions are recommended:

  1. Burn Rate Alerts
  2. Error Budget Alerts

Error Budget

Suppose our contract with users specifies 99.9% availability over a 7-day window. This translates to a 10-minute error budget.

A reference implementation for error budget alerting:

  1. Calculate the error budget consumed over the past 7 days (or longer such as 30 days, or shorter such as 3 days)
  2. Alert levels:
    1. CRITICAL: error budget >= 90% (or 100%) (i.e., 9.03 minutes of unavailability in the past 7 days; availability has reached 99.91%, approaching the 99.9% danger threshold)
    2. WARNING: error budget >= 75%

│ 📝Notes:

│ Key Words:

│ - SLO
│ - Time window
│ - Threshold

Burn Rate

Suppose our contract with users specifies 99.9% availability over a 30-day window. This translates to a 43-minute error budget. If we consume those 43 minutes through small incremental failures, our users may still be happy and productive. But what if we have a single 43-minute outage during critical business hours? It’s safe to say our users would be very unhappy with that experience!

To address this, Google SRE introduced the concept of Burn Rate. The definition is simple: if we consume exactly 43 minutes over 30 days in our example, that’s a burn rate of 1. If we consume it at twice the speed — for example, exhausting the budget in 15 days — the burn rate is 2, and so on. As you can see, this allows us to track long-term compliance while alerting on severe short-term issues.

The diagram below illustrates the concept of multiple burn rates. The X-axis represents time, and the Y-axis represents the remaining error budget.

SLO Burn Rate

│ 📝Notes:

│ Essentially, an alert for error budget >= 100% is just a special case of burn rate = 1.

A reference implementation for burn rate alerting:

  1. Calculate the burn rate over the past 1 hour (or shorter windows like 5m, or longer windows like 3h–6h…)
  2. Alert levels:
    1. CRITICAL: burn rate >= 14.4 (at this rate, the 30-day availability error budget will be exhausted within 2 days)
    2. WARNING: burn rate >= 7.2 (at this rate, the 30-day availability error budget will be exhausted within 4 days)

Configuring SLO Monitoring and Alerting with Prometheus in Practice

Here we use 2 typical SLOs as examples:

  1. HTTP request error rate greater than 99.9% (i.e., 43min 11s of unavailability in a 30-day window)
  2. 99% of HTTP request latency greater than 100ms

HTTP Request Error Rate

Basic information:

  1. Metric: http_requests_total
  2. Label: {job=busi}
  3. Error definition: HTTP status code 5xx, i.e., code=~“5xx”

The complete Prometheus Rule is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
groups:
- name: SLOs-http_requests_total
rules:
# Error rate of HTTP requests over the past 5m
- expr: |
sum(rate(http_requests_total{job="busi",code=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="busi"}[5m]))
labels:
job: busi
record: http_requests_total:burnrate5m
# Over the past 30m
- expr: |
sum(rate(http_requests_total{job="busi",code=~"5.."}[30m]))
/
sum(rate(http_requests_total{job="busi"}[30m]))
labels:
job: busi
record: http_requests_total:burnrate30m
# Over the past 1h
- expr: |
sum(rate(http_requests_total{job="busi",code=~"5.."}[1h]))
/
sum(rate(http_requests_total{job="busi"}[1h]))
labels:
job: busi
record: http_requests_total:burnrate1h
# Over the past 6h
- expr: |
sum(rate(http_requests_total{job="busi",code=~"5.."}[6h]))
/
sum(rate(http_requests_total{job="busi"}[6h]))
labels:
job: busi
record: http_requests_total:burnrate6h
# Over the past 1d
- expr: |
sum(rate(http_requests_total{job="busi",code=~"5.."}[1d]))
/
sum(rate(http_requests_total{job="busi"}[1d]))
labels:
job: busi
record: http_requests_total:burnrate1d
# Over the past 3d
- expr: |
sum(rate(http_requests_total{job="busi",code=~"5.."}[3d]))
/
sum(rate(http_requests_total{job="busi"}[3d]))
labels:
job: busi
record: http_requests_total:burnrate3d
# 🐾 Rapid short-term burn
# Burn rate over the past 5m and 1h both exceed 14.4
- alert: ErrorBudgetBurn
annotations:
message: 'High error budget burn for job=busi (current value: {{ $value }})'
expr: |
sum(http_requests_total:burnrate5m{job="busi"}) > (14.40 * (1-0.99900))
and
sum(http_requests_total:burnrate1h{job="busi"}) > (14.40 * (1-0.99900))
for: 2m
labels:
job: busi
severity: critical
# 🐾 Excessive mid-term burn
# Burn rate over the past 30m and 6h both exceed 7.2
- alert: ErrorBudgetBurn
annotations:
message: 'High error budget burn for job=busi (current value: {{ $value }})'
expr: |
sum(http_requests_total:burnrate30m{job="busi"}) > (7.20 * (1-0.99900))
and
sum(http_requests_total:burnrate6h{job="busi"}) > (7.20 * (1-0.99900))
for: 15m
labels:
job: busi
severity: warning
# 🐾 Long-term error budget exceeded
# Error budget over the past 6h and 3d has been exhausted
- alert: ErrorBudgetAlert
annotations:
message: 'High error budget burn for job=busi (current value: {{ $value }})'
expr: |
sum(http_requests_total:burnrate6h{job="busi"}) > (1.00 * (1-0.99900))
and
sum(http_requests_total:burnrate3d{job="busi"}) > (1.00 * (1-0.99900))
for: 3h
labels:
job: busi
severity: warning

HTTP Request Latency

Basic information:

  1. Metric: http_request_duration_seconds
  2. Label: {job=busi}
  3. 99% of HTTP request response times should be less than or equal to 100ms
  4. Only successful requests are counted (since error rate is already covered above)

The complete Prometheus Rule is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
groups:
- name: SLOs-http_request_duration_seconds
rules:
# Percentage of HTTP requests with response time exceeding 100ms (0.1s) over the past 5m
- expr: |
1 - (
sum(rate(http_request_duration_seconds_bucket{job="busi",le="0.1",code!~"5.."}[5m]))
/
sum(rate(http_request_duration_seconds_count{job="busi"}[5m]))
)
labels:
job: busi
latency: "0.1"
record: latencytarget:http_request_duration_seconds:rate5m
# Over the past 30m
- expr: |
1 - (
sum(rate(http_request_duration_seconds_bucket{job="busi",le="0.1",code!~"5.."}[30m]))
/
sum(rate(http_request_duration_seconds_count{job="busi"}[30m]))
)
labels:
job: busi
latency: "0.1"
record: latencytarget:http_request_duration_seconds:rate30m
# Over the past 1h
- expr: |
1 - (
sum(rate(http_request_duration_seconds_bucket{job="busi",le="0.1",code!~"5.."}[1h]))
/
sum(rate(http_request_duration_seconds_count{job="busi"}[1h]))
)
labels:
job: busi
latency: "0.1"
record: latencytarget:http_request_duration_seconds:rate1h
# Over the past 2h
- expr: |
1 - (
sum(rate(http_request_duration_seconds_bucket{job="busi",le="0.1",code!~"5.."}[2h]))
/
sum(rate(http_request_duration_seconds_count{job="busi"}[2h]))
)
labels:
job: busi
latency: "0.1"
record: latencytarget:http_request_duration_seconds:rate2h
# Over the past 6h
- expr: |
1 - (
sum(rate(http_request_duration_seconds_bucket{job="busi",le="0.1",code!~"5.."}[6h]))
/
sum(rate(http_request_duration_seconds_count{job="busi"}[6h]))
)
labels:
job: busi
latency: "0.1"
record: latencytarget:http_request_duration_seconds:rate6h
# Over the past 1d
- expr: |
1 - (
sum(rate(http_request_duration_seconds_bucket{job="busi",le="0.1",code!~"5.."}[1d]))
/
sum(rate(http_request_duration_seconds_count{job="busi"}[1d]))
)
labels:
job: busi
latency: "0.1"
record: latencytarget:http_request_duration_seconds:rate1d
# Over the past 3d
- expr: |
1 - (
sum(rate(http_request_duration_seconds_bucket{job="busi",le="0.1",code!~"5.."}[3d]))
/
sum(rate(http_request_duration_seconds_count{job="busi"}[3d]))
)
labels:
job: busi
latency: "0.1"
record: latencytarget:http_request_duration_seconds:rate3d
# 🐾 HTTP response time SLO short/mid-term rapid burn
# - Past 5m and 1h burn rate exceeds 14.4
# - Or: past 30m and 6h burn rate exceeds 7.2
- alert: LatencyBudgetBurn
annotations:
message: 'High requests latency budget burn for job=busi,latency=0.1 (current value: {{ $value }})'
expr: |
(
latencytarget:http_request_duration_seconds:rate1h{job="busi",latency="0.1"} > (14.4*(1-0.99))
and
latencytarget:http_request_duration_seconds:rate5m{job="busi",latency="0.1"} > (14.4*(1-0.99))
)
or
(
latencytarget:http_request_duration_seconds:rate6h{job="busi",latency="0.1"} > (7.2*(1-0.99))
and
latencytarget:http_request_duration_seconds:rate30m{job="busi",latency="0.1"} > (7.2*(1-0.99))
)
labels:
job: busi
latency: "0.1"
severity: critical
- alert: LatencyBudgetBurn
annotations:
message: 'High requests latency budget burn for job=busi,latency=0.1 (current value: {{ $value }})'
expr: |
(
latencytarget:http_request_duration_seconds:rate1d{job="busi",latency="0.1"} > (3*(1-0.99))
and
latencytarget:http_request_duration_seconds:rate2h{job="busi",latency="0.1"} > (3*(1-0.99))
)
or
(
latencytarget:http_request_duration_seconds:rate3d{job="busi",latency="0.1"} > ((1-0.99))
and
latencytarget:http_request_duration_seconds:rate6h{job="busi",latency="0.1"} > ((1-0.99))
)
labels:
job: busi
latency: "0.1"
severity: warning

🎉🎉🎉

Summary

As the de facto standard for cloud-native and container platform monitoring, this article covered how to configure SLO monitoring and alerting with Prometheus.

We demonstrated 2 typical SLOs — HTTP response time and error rate. The error rate SLO is fairly straightforward, while the response time SLO can be a bit tricky and may take some time to digest.

😼😼😼

📚️References


Configuring SLO Monitoring and Alerting with Prometheus
https://e-whisper.com/posts/51959/
Author
east4ming
Posted on
October 14, 2022
Licensed under