Configuring SLO Monitoring and Alerting with Prometheus
This article was last updated on: May 17, 2026 am
Overview
As the de facto standard for cloud-native and container platform monitoring, let’s look at how to configure SLO monitoring and alerting with Prometheus.
For SLO alerting, based on Google SRE best practices, the following alert dimensions are recommended:
Burn Rate Alerts
Error Budget Alerts
Error Budget
Suppose our contract with users specifies 99.9% availability over a 7-day window. This translates to a 10-minute error budget.
A reference implementation for error budget alerting:
Calculate the error budget consumed over the past 7 days (or longer such as 30 days, or shorter such as 3 days)
Alert levels:
CRITICAL: error budget >= 90% (or 100%) (i.e., 9.03 minutes of unavailability in the past 7 days; availability has reached 99.91%, approaching the 99.9% danger threshold)
Suppose our contract with users specifies 99.9% availability over a 30-day window. This translates to a 43-minute error budget. If we consume those 43 minutes through small incremental failures, our users may still be happy and productive. But what if we have a single 43-minute outage during critical business hours? It’s safe to say our users would be very unhappy with that experience!
To address this, Google SRE introduced the concept of Burn Rate. The definition is simple: if we consume exactly 43 minutes over 30 days in our example, that’s a burn rate of 1. If we consume it at twice the speed — for example, exhausting the budget in 15 days — the burn rate is 2, and so on. As you can see, this allows us to track long-term compliance while alerting on severe short-term issues.
The diagram below illustrates the concept of multiple burn rates. The X-axis represents time, and the Y-axis represents the remaining error budget.
│ 📝Notes:
│
│ Essentially, an alert for error budget >= 100% is just a special case of burn rate = 1.
A reference implementation for burn rate alerting:
Calculate the burn rate over the past 1 hour (or shorter windows like 5m, or longer windows like 3h–6h…)
Alert levels:
CRITICAL: burn rate >= 14.4 (at this rate, the 30-day availability error budget will be exhausted within 2 days)
WARNING: burn rate >= 7.2 (at this rate, the 30-day availability error budget will be exhausted within 4 days)
Configuring SLO Monitoring and Alerting with Prometheus in Practice
Here we use 2 typical SLOs as examples:
HTTP request error rate greater than 99.9% (i.e., 43min 11s of unavailability in a 30-day window)
99% of HTTP request latency greater than 100ms
HTTP Request Error Rate
Basic information:
Metric: http_requests_total
Label: {job=busi}
Error definition: HTTP status code 5xx, i.e., code=~“5xx”
groups: -name:SLOs-http_request_duration_seconds rules: # Percentage of HTTP requests with response time exceeding 100ms (0.1s) over the past 5m -expr:| 1 - ( sum(rate(http_request_duration_seconds_bucket{job="busi",le="0.1",code!~"5.."}[5m])) / sum(rate(http_request_duration_seconds_count{job="busi"}[5m])) ) labels: job:busi latency:"0.1" record:latencytarget:http_request_duration_seconds:rate5m # Over the past 30m -expr:| 1 - ( sum(rate(http_request_duration_seconds_bucket{job="busi",le="0.1",code!~"5.."}[30m])) / sum(rate(http_request_duration_seconds_count{job="busi"}[30m])) ) labels: job:busi latency:"0.1" record:latencytarget:http_request_duration_seconds:rate30m # Over the past 1h -expr:| 1 - ( sum(rate(http_request_duration_seconds_bucket{job="busi",le="0.1",code!~"5.."}[1h])) / sum(rate(http_request_duration_seconds_count{job="busi"}[1h])) ) labels: job:busi latency:"0.1" record:latencytarget:http_request_duration_seconds:rate1h # Over the past 2h -expr:| 1 - ( sum(rate(http_request_duration_seconds_bucket{job="busi",le="0.1",code!~"5.."}[2h])) / sum(rate(http_request_duration_seconds_count{job="busi"}[2h])) ) labels: job:busi latency:"0.1" record:latencytarget:http_request_duration_seconds:rate2h # Over the past 6h -expr:| 1 - ( sum(rate(http_request_duration_seconds_bucket{job="busi",le="0.1",code!~"5.."}[6h])) / sum(rate(http_request_duration_seconds_count{job="busi"}[6h])) ) labels: job:busi latency:"0.1" record:latencytarget:http_request_duration_seconds:rate6h # Over the past 1d -expr:| 1 - ( sum(rate(http_request_duration_seconds_bucket{job="busi",le="0.1",code!~"5.."}[1d])) / sum(rate(http_request_duration_seconds_count{job="busi"}[1d])) ) labels: job:busi latency:"0.1" record:latencytarget:http_request_duration_seconds:rate1d # Over the past 3d -expr:| 1 - ( sum(rate(http_request_duration_seconds_bucket{job="busi",le="0.1",code!~"5.."}[3d])) / sum(rate(http_request_duration_seconds_count{job="busi"}[3d])) ) labels: job:busi latency:"0.1" record:latencytarget:http_request_duration_seconds:rate3d # 🐾 HTTP response time SLO short/mid-term rapid burn # - Past 5m and 1h burn rate exceeds 14.4 # - Or: past 30m and 6h burn rate exceeds 7.2 -alert:LatencyBudgetBurn annotations: message:'High requests latency budget burn for job=busi,latency=0.1 (current value: {{ $value }})' expr:| ( latencytarget:http_request_duration_seconds:rate1h{job="busi",latency="0.1"} > (14.4*(1-0.99)) and latencytarget:http_request_duration_seconds:rate5m{job="busi",latency="0.1"} > (14.4*(1-0.99)) ) or ( latencytarget:http_request_duration_seconds:rate6h{job="busi",latency="0.1"} > (7.2*(1-0.99)) and latencytarget:http_request_duration_seconds:rate30m{job="busi",latency="0.1"} > (7.2*(1-0.99)) ) labels: job:busi latency:"0.1" severity:critical -alert:LatencyBudgetBurn annotations: message:'High requests latency budget burn for job=busi,latency=0.1 (current value: {{ $value }})' expr:| ( latencytarget:http_request_duration_seconds:rate1d{job="busi",latency="0.1"} > (3*(1-0.99)) and latencytarget:http_request_duration_seconds:rate2h{job="busi",latency="0.1"} > (3*(1-0.99)) ) or ( latencytarget:http_request_duration_seconds:rate3d{job="busi",latency="0.1"} > ((1-0.99)) and latencytarget:http_request_duration_seconds:rate6h{job="busi",latency="0.1"} > ((1-0.99)) ) labels: job:busi latency:"0.1" severity:warning
🎉🎉🎉
Summary
As the de facto standard for cloud-native and container platform monitoring, this article covered how to configure SLO monitoring and alerting with Prometheus.
We demonstrated 2 typical SLOs — HTTP response time and error rate. The error rate SLO is fairly straightforward, while the response time SLO can be a bit tricky and may take some time to digest.