Configuring SLO Monitoring and Alerting with Prometheus

This article was last updated on: June 29, 2026 pm

Overview

As the de facto standard for cloud-native and container platform monitoring, let’s look at how to configure SLO monitoring and alerting with Prometheus.

SLA SLO SLI Article Series

SLO Alerting

For SLO alerting, based on Google SRE best practices, the following alert dimensions are recommended:

Burn Rate Alerts
Error Budget Alerts

Error Budget

Suppose our contract with users specifies 99.9% availability over a 7-day window. This translates to a 10-minute error budget.

A reference implementation for error budget alerting:

Calculate the error budget consumed over the past 7 days (or longer such as 30 days, or shorter such as 3 days)
Alert levels:
1. CRITICAL: error budget >= 90% (or 100%) (i.e., 9.03 minutes of unavailability in the past 7 days; availability has reached 99.91%, approaching the 99.9% danger threshold)
2. WARNING: error budget >= 75%

│ 📝Notes:
│
│ Key Words:
│
│ - SLO
│ - Time window
│ - Threshold

Burn Rate

Suppose our contract with users specifies 99.9% availability over a 30-day window. This translates to a 43-minute error budget. If we consume those 43 minutes through small incremental failures, our users may still be happy and productive. But what if we have a single 43-minute outage during critical business hours? It’s safe to say our users would be very unhappy with that experience!

To address this, Google SRE introduced the concept of Burn Rate. The definition is simple: if we consume exactly 43 minutes over 30 days in our example, that’s a burn rate of 1. If we consume it at twice the speed — for example, exhausting the budget in 15 days — the burn rate is 2, and so on. As you can see, this allows us to track long-term compliance while alerting on severe short-term issues.

The diagram below illustrates the concept of multiple burn rates. The X-axis represents time, and the Y-axis represents the remaining error budget.

SLO Burn Rate

│ 📝Notes:
│
│ Essentially, an alert for error budget >= 100% is just a special case of burn rate = 1.

A reference implementation for burn rate alerting:

Calculate the burn rate over the past 1 hour (or shorter windows like 5m, or longer windows like 3h–6h…)
Alert levels:
1. CRITICAL: burn rate >= 14.4 (at this rate, the 30-day availability error budget will be exhausted within 2 days)
2. WARNING: burn rate >= 7.2 (at this rate, the 30-day availability error budget will be exhausted within 4 days)

Configuring SLO Monitoring and Alerting with Prometheus in Practice

Here we use 2 typical SLOs as examples:

HTTP request error rate greater than 99.9% (i.e., 43min 11s of unavailability in a 30-day window)
99% of HTTP request latency greater than 100ms

HTTP Request Error Rate

Basic information:

Metric: http_requests_total
Label: {job=busi}
Error definition: HTTP status code 5xx, i.e., code=~“5xx”

The complete Prometheus Rule is as follows:

groups:
- name: SLOs-http_requests_total
  rules:
  # Error rate of HTTP requests over the past 5m
  - expr: |
      sum(rate(http_requests_total{job="busi",code=~"5.."}[5m]))
      /
      sum(rate(http_requests_total{job="busi"}[5m]))
    labels:
      job: busi
    record: http_requests_total:burnrate5m
  # Over the past 30m
  - expr: |
      sum(rate(http_requests_total{job="busi",code=~"5.."}[30m]))
      /
      sum(rate(http_requests_total{job="busi"}[30m]))
    labels:
      job: busi
    record: http_requests_total:burnrate30m
  # Over the past 1h
  - expr: |
      sum(rate(http_requests_total{job="busi",code=~"5.."}[1h]))
      /
      sum(rate(http_requests_total{job="busi"}[1h]))
    labels:
      job: busi
    record: http_requests_total:burnrate1h
  # Over the past 6h
  - expr: |
      sum(rate(http_requests_total{job="busi",code=~"5.."}[6h]))
      /
      sum(rate(http_requests_total{job="busi"}[6h]))
    labels:
      job: busi
    record: http_requests_total:burnrate6h
  # Over the past 1d
  - expr: |
      sum(rate(http_requests_total{job="busi",code=~"5.."}[1d]))
      /
      sum(rate(http_requests_total{job="busi"}[1d]))
    labels:
      job: busi
    record: http_requests_total:burnrate1d
  # Over the past 3d
  - expr: |
      sum(rate(http_requests_total{job="busi",code=~"5.."}[3d]))
      /
      sum(rate(http_requests_total{job="busi"}[3d]))
    labels:
      job: busi
    record: http_requests_total:burnrate3d
  # 🐾 Rapid short-term burn
  # Burn rate over the past 5m and 1h both exceed 14.4
  - alert: ErrorBudgetBurn
    annotations:
      message: 'High error budget burn for job=busi (current value: {{ $value }})'
    expr: |
      sum(http_requests_total:burnrate5m{job="busi"}) > (14.40 * (1-0.99900))
      and
      sum(http_requests_total:burnrate1h{job="busi"}) > (14.40 * (1-0.99900))
    for: 2m
    labels:
      job: busi
      severity: critical
  # 🐾 Excessive mid-term burn
  # Burn rate over the past 30m and 6h both exceed 7.2
  - alert: ErrorBudgetBurn
    annotations:
      message: 'High error budget burn for job=busi (current value: {{ $value }})'
    expr: |
      sum(http_requests_total:burnrate30m{job="busi"}) > (7.20 * (1-0.99900))
      and
      sum(http_requests_total:burnrate6h{job="busi"}) > (7.20 * (1-0.99900))
    for: 15m
    labels:
      job: busi
      severity: warning
  # 🐾 Long-term error budget exceeded
  # Error budget over the past 6h and 3d has been exhausted
  - alert: ErrorBudgetAlert
    annotations:
      message: 'High error budget burn for job=busi (current value: {{ $value }})'
    expr: |
      sum(http_requests_total:burnrate6h{job="busi"}) > (1.00 * (1-0.99900))
      and
      sum(http_requests_total:burnrate3d{job="busi"}) > (1.00 * (1-0.99900))
    for: 3h
    labels:
      job: busi
      severity: warning

HTTP Request Latency

Basic information:

Metric: http_request_duration_seconds
Label: {job=busi}
99% of HTTP request response times should be less than or equal to 100ms
Only successful requests are counted (since error rate is already covered above)

The complete Prometheus Rule is as follows:

groups:
- name: SLOs-http_request_duration_seconds
  rules:
  # Percentage of HTTP requests with response time exceeding 100ms (0.1s) over the past 5m
  - expr: |
      1 - (
        sum(rate(http_request_duration_seconds_bucket{job="busi",le="0.1",code!~"5.."}[5m]))
        /
        sum(rate(http_request_duration_seconds_count{job="busi"}[5m]))
      )
    labels:
      job: busi
      latency: "0.1"
    record: latencytarget:http_request_duration_seconds:rate5m
  # Over the past 30m
  - expr: |
      1 - (
        sum(rate(http_request_duration_seconds_bucket{job="busi",le="0.1",code!~"5.."}[30m]))
        /
        sum(rate(http_request_duration_seconds_count{job="busi"}[30m]))
      )
    labels:
      job: busi
      latency: "0.1"
    record: latencytarget:http_request_duration_seconds:rate30m
  # Over the past 1h
  - expr: |
      1 - (
        sum(rate(http_request_duration_seconds_bucket{job="busi",le="0.1",code!~"5.."}[1h]))
        /
        sum(rate(http_request_duration_seconds_count{job="busi"}[1h]))
      )
    labels:
      job: busi
      latency: "0.1"
    record: latencytarget:http_request_duration_seconds:rate1h
  # Over the past 2h
  - expr: |
      1 - (
        sum(rate(http_request_duration_seconds_bucket{job="busi",le="0.1",code!~"5.."}[2h]))
        /
        sum(rate(http_request_duration_seconds_count{job="busi"}[2h]))
      )
    labels:
      job: busi
      latency: "0.1"
    record: latencytarget:http_request_duration_seconds:rate2h
  # Over the past 6h
  - expr: |
      1 - (
        sum(rate(http_request_duration_seconds_bucket{job="busi",le="0.1",code!~"5.."}[6h]))
        /
        sum(rate(http_request_duration_seconds_count{job="busi"}[6h]))
      )
    labels:
      job: busi
      latency: "0.1"
    record: latencytarget:http_request_duration_seconds:rate6h
  # Over the past 1d
  - expr: |
      1 - (
        sum(rate(http_request_duration_seconds_bucket{job="busi",le="0.1",code!~"5.."}[1d]))
        /
        sum(rate(http_request_duration_seconds_count{job="busi"}[1d]))
      )
    labels:
      job: busi
      latency: "0.1"
    record: latencytarget:http_request_duration_seconds:rate1d
  # Over the past 3d
  - expr: |
      1 - (
        sum(rate(http_request_duration_seconds_bucket{job="busi",le="0.1",code!~"5.."}[3d]))
        /
        sum(rate(http_request_duration_seconds_count{job="busi"}[3d]))
      )
    labels:
      job: busi
      latency: "0.1"
    record: latencytarget:http_request_duration_seconds:rate3d  
  # 🐾 HTTP response time SLO short/mid-term rapid burn
  # - Past 5m and 1h burn rate exceeds 14.4
  # - Or: past 30m and 6h burn rate exceeds 7.2
  - alert: LatencyBudgetBurn
    annotations:
      message: 'High requests latency budget burn for job=busi,latency=0.1 (current value: {{ $value }})'
    expr: |
      (
        latencytarget:http_request_duration_seconds:rate1h{job="busi",latency="0.1"} > (14.4*(1-0.99))
        and
        latencytarget:http_request_duration_seconds:rate5m{job="busi",latency="0.1"} > (14.4*(1-0.99))
      )
      or
      (
        latencytarget:http_request_duration_seconds:rate6h{job="busi",latency="0.1"} > (7.2*(1-0.99))
        and
        latencytarget:http_request_duration_seconds:rate30m{job="busi",latency="0.1"} > (7.2*(1-0.99))
      )
    labels:
      job: busi
      latency: "0.1"
      severity: critical
  - alert: LatencyBudgetBurn
    annotations:
      message: 'High requests latency budget burn for job=busi,latency=0.1 (current value: {{ $value }})'
    expr: |
      (
        latencytarget:http_request_duration_seconds:rate1d{job="busi",latency="0.1"} > (3*(1-0.99))
        and
        latencytarget:http_request_duration_seconds:rate2h{job="busi",latency="0.1"} > (3*(1-0.99))
      )
      or
      (
        latencytarget:http_request_duration_seconds:rate3d{job="busi",latency="0.1"} > ((1-0.99))
        and
        latencytarget:http_request_duration_seconds:rate6h{job="busi",latency="0.1"} > ((1-0.99))
      )
    labels:
      job: busi
      latency: "0.1"
      severity: warning

🎉🎉🎉

Summary

As the de facto standard for cloud-native and container platform monitoring, this article covered how to configure SLO monitoring and alerting with Prometheus.

We demonstrated 2 typical SLOs — HTTP response time and error rate. The error rate SLO is fairly straightforward, while the response time SLO can be a bit tricky and may take some time to digest.

😼😼😼

📚️References

Observability

#DevOps #Monitoring #Observability #SRE #Prometheus #Cloud #SLA #SLO #SLI #Translation

Configuring SLO Monitoring and Alerting with Prometheus

https://e-whisper.com/posts/51959/

Author

east4ming

Posted on

October 14, 2022

Licensed under

K8s Production Best Practices - Limit NameSpace resource usage Previous

Seven Steps to Poems - Quickly Create Effective SLOs Next