Grafana Series - Loki - Implementing Alerting Based on Logs

This article was last updated on: May 17, 2026 am

Series Articles

Loki Series Articles

Introduction

In practice, beyond metrics-based alerting, there is often a need for log-based alerting as a supplement. A typical example is error rate alerting based on NGINX logs. This article introduces how to implement log-based alerting using Loki.

We will walk through two real-world scenarios:

Error rate alerting based on NGINX logs
Heartbeat anomaly alerting based on Nomad logs (for an introduction to Nomad, see this article: Large-Scale IoT Edge Container Cluster Management Architectures - 2 - HashiCorp Solution Nomad)

Use Cases for Log-Based Alerting

Log-based alerting is widely used in the following scenarios:

Black-Box Monitoring

For components we didn’t develop — such as cloud provider/third-party load balancers and countless other components (both open-source and closed-source third-party) that support our applications but don’t expose the metrics we want. Some don’t expose any metrics at all. Loki’s alerting and recording rules can generate metrics and alerts about system state by leveraging logs to bring these components into our observability stack. This is an extremely powerful way to introduce advanced observability into legacy architectures.

Event Alerting

Sometimes you simply want to know whether something has happened. Alerting based on logs works well for this — for example, detecting leaked authentication credentials:

- name: credentials_leak
  rules: 
    - alert: http-credentials-leaked
      annotations: 
        message: "{{ $labels.job }} is leaking http basic auth credentials."
      expr: 'sum by (cluster, job, pod) (count_over_time({namespace="prod"} |~ "http(s?)://(\\w+):(\\w+)@" [5m]) > 0)'
      for: 10m
      labels: 
        severity: critical

The Nomad scenario falls into this category.

Technical Background

Loki Alerting

Grafana Loki includes a component called the ruler. The ruler is responsible for continuously evaluating a set of configurable queries and taking actions based on the results. It supports two types of rules: alerting rules and recording rules.

Loki Alerting Rules

Loki’s alerting rule format is almost identical to Prometheus. Here is a complete example:

groups:
  - name: should_fire
    rules:
      - alert: HighPercentageError
        expr: |
          sum(rate({app="foo", env="production"} |= "error" [5m])) by (job)
            /
          sum(rate({app="foo", env="production"}[5m])) by (job)
            > 0.05
        for: 10m
        labels:
            severity: page
        annotations:
            summary: High request latency
  - name: credentials_leak
    rules: 
      - alert: http-credentials-leaked
        annotations: 
          message: "{{ $labels.job }} is leaking http basic auth credentials."
        expr: 'sum by (cluster, job, pod) (count_over_time({namespace="prod"} |~ "http(s?)://(\\w+):(\\w+)@" [5m]) > 0)'
        for: 10m
        labels: 
          severity: critical

Loki LogQL Queries

Loki’s Log Query Language (LogQL) is a query language used to retrieve logs from Loki. LogQL is very similar to Prometheus but has some important differences.

LogQL Quick Start

All LogQL queries contain a log stream selector. As shown below:

Log Stream Selector

Optionally, a log pipeline can be appended after the log stream selector. A log pipeline is a set of stage expressions that are chained together and applied to the selected log stream. Each expression can filter, parse, or modify log lines and their respective labels.

The following example shows a complete log query in action:

logql
{container=“query-frontend”,namespace=“loki-dev”}
|= “metrics.go”
| logfmt
| duration > 10s
and throughput_mb < 500

This query consists of:

A log stream selector {container=“query-frontend”,namespace=“loki-dev”} targeting the query-frontend container in the loki-dev namespace.
A log pipeline |= “metrics.go” | logfmt | duration > 10s and throughput_mb < 500 that filters out log lines containing the word metrics.go, then parses each log line to extract additional labels and uses them for filtering.

Parser Expressions

For alerting purposes, we often need to parse unstructured logs before alerting. Parsing yields more precise field information (called labels), which is why we need parser expressions.

Parser expressions parse and extract labels from log content. These extracted labels can be used for filtering with label filter expressions or for metrics aggregation.

If an extracted label key name already exists in the original log stream (typically something like level), the extracted label key will be suffixed with the _extracted keyword to distinguish the two labels. You can also use label format expressions to forcefully override the original label. However, if an extracted key appears twice, only the first label value is retained.

Loki supports JSON, logfmt, pattern, regexp, and unpack parsers.

Today we will focus on the logfmt, pattern, and regexp parsers.

logfmt Parser

The logfmt parser can operate in two modes:

Without Parameters

The logfmt parser can be added with | logfmt and will extract all keys and values from logfmt-formatted log lines.

For example, the following log line:

1	`at=info method=GET path=/ host=grafana.net fwd="124.133.124.161" service=8ms status=200`

Will extract the following labels:

"at" => "info"
"method" => "GET"
"path" => "/"
"host" => "grafana.net"
"fwd" => "124.133.124.161"
"service" => "8ms"
"status" => "200"

With Parameters

Similar to the JSON parser, using | logfmt label=“expression”, another=“expression” in the pipeline will extract only the fields specified by the labels.

For example, | logfmt host, fwd_ip=“fwd” will extract the labels host and fwd from the following log line:

1	`at=info method=GET path=/ host=grafana.net fwd="124.133.124.161" service=8ms status=200`

And rename fwd to fwd_ip:

1 2	`"host" => "grafana.net" "fwd_ip" => "124.133.124.161"`

Pattern Parser

The pattern parser allows explicit extraction of fields from log lines by defining a pattern expression (| pattern “”). The expression matches the structure of the log line.

A typical example is NGINX logs:

1	`0.191.12.2 - - [10/Jun/2021:09:14:29 +0000] "GET /api/plugins/versioncheck HTTP/1.1" 200 2 "-" "Go-http-client/2.0" "13.76.247.102, 34.120.177.193" "TLSv1.2" "US" ""`

This log line can be parsed with the expression:

1	`<ip> - - <_> "<method> <uri> <_>" <status> <size> <_> "<agent>" <_>`

Extracting these fields:

"ip" => "0.191.12.2"
"method" => "GET"
"uri" => "/api/plugins/versioncheck"
"status" => "200"
"size" => "2"
"agent" => "Go-http-client/2.0"

A pattern expression consists of captures and literals.

Captures are field names delimited by < and > characters. defines a field name example. Unnamed captures appear as <_>. Unnamed captures skip the matched content.

Regular Expression Parser

Unlike logfmt and JSON which implicitly extract all values without requiring parameters, the regexp parser requires exactly one parameter | regexp “”, which is a regular expression using Golang RE2 syntax.

The regular expression must contain at least one named sub-match (e.g., (?Pre)), and each sub-match will extract a different label.

For example, the parser | regexp “(?P\w+) (?P[\w|/]+) \((?P\d+?)\) (?P.*)” will extract from the following line:

1	`POST /api/prom/api/v1/query_range (200) 1.5s`

Into these labels:

"method" => "POST"
"path" => "/api/prom/api/v1/query_range"
"status" => "200"
"duration" => "1.5s"

Hands-On Practice

│ 📝 Note:
│
│ The following two examples are only meant to demonstrate real-world use cases for Loki. In a production environment, if you can already obtain metrics like:
│
│ * NGINX error rate
│ * Nomad Client active count / Nomad Client total count
│
│ via Prometheus, you can alert directly using Prometheus. There’s no need to go the extra mile.

Error Rate Alerting Based on NGINX Logs

We will use the | pattern parser to extract the status label from NGINX logs and use the rate() function to calculate the per-second error rate.

Assume the NGINX log looks like this:

1	`0.191.12.2 - - [10/Jun/2021:09:14:29 +0000] "GET /api/plugins/versioncheck HTTP/1.1" 200 2 "-" "Go-http-client/2.0" "13.76.247.102, 34.120.177.193" "TLSv1.2" "US" ""`

This log line can be parsed with the expression:

1	`<ip> - - <_> "<method> <uri> <_>" <status> <size> <_> "<agent>" <_>`

Extracting these fields:

"ip" => "0.191.12.2"
"method" => "GET"
"uri" => "/api/plugins/versioncheck"
"status" => "200"
"size" => "2"
"agent" => "Go-http-client/2.0"

Then calculate based on the status label, where status > 500 counts as an error. The final alerting expression is:

LogQL
sum(rate({job=“nginx”} | pattern - - <> " <>" <> “” <> | status > 500 [5m])) by (instance)
/
sum(rate({job=“nginx”} [5m])) by (instance)

0.05

Detailed explanation:

The full LogQL means: NGINX per-instance error rate > 5%
{job=“nginx”} — Log Stream, assuming the NGINX job is named nginx, indicating we’re querying NGINX logs.
| pattern - - <> " <>" <> “” <> — Uses the pattern parser as explained in detail above.
| status > 500 — After parsing, the status label is obtained. The log pipeline filters for error logs where status > 500.
rate(… [5m]) — Calculates the per-second count of 500+ errors over a 5-minute window.
sum () by (instance) — Aggregates by instance, calculating the per-second 500+ error count for each instance.
/ sum(rate({job=“nginx”} [5m])) by (instance) > 0.05 — Divides the per-second 500+ error count per instance by the total per-second request count per instance to determine whether the error rate exceeds 5%.

Then create an alerting rule using this metric:

alert: NGINXRequestsErrorRate
expr: >-
  sum(rate({job="nginx"} | pattern <ip> - - <_> "<method> <uri> <_>" <status> <size> <_> "<agent>" <_> | status > 500 [5m])) by (instance)
    /
  sum(rate({job="nginx"} [5m])) by (instance)
    > 0.05
for: 1m
annotations:
  summary: NGINX instance {{ $labels.instance }} error rate exceeds 5%.
  description: ''
  runbook_url: ''
labels:
  severity: 'warning'

Done! 🎉🎉🎉

Heartbeat Anomaly Alerting Based on Nomad Logs

A typical Nomad log format looks like this:

1
2

2023-12-08T21:39:09.718+0800 [WARN]  nomad.heartbeat: node TTL expired: node_id=daf861cc-641d-f0a6-62ee-d954f6edd3a4
2023-12-07T21:39:04.905+0800 [ERROR] nomad.rpc: multiplex_v2 conn accept failed: error="keepalive timeout"

Here I first tried using the pattern parser with the following expression:

LogQL
{unit=“nomad.service”, transport=“stdout”}
| pattern [] :

The result was incorrect. After parsing:

...
"level" => "WARN"
...
"level" => ERROR] nomad.rpc: multiplex_v2 conn accept failed: error="keepalive timeout"

The level parsing is clearly wrong. The reason is that the character after level is not a space but a tab character. This results in 2 spaces after [WARN] and 1 space after [ERROR]. The pattern parser doesn’t handle this situation well, and after checking the official documentation, I couldn’t find a solution for this case in the short term.

So ultimately I had to use the regexp parser instead.

The final parsing expression is:

LogQL
{unit=“nomad.service”, transport=“stdout”}
| regexp `(?P\S+)\s+[(?P\w+)]\s+(?P\S+): (.+)

Detailed explanation:

(?P\S+) — Parses the timestamp. In Nomad’s format, this is the first sequence of non-whitespace characters. e.g., 2023-12-08T21:39:09.718+0800
\s+ — Matches whitespace between the timestamp and log level
[(?P\w+)] — Matches the log level, e.g., [WARN] [ERROR]. Here [] are special characters, so they need to be escaped with \
\s+ — Matches whitespace between the log level and component. Whether it’s one/two spaces or a tab, it all matches
(?P\S+): — Matches the component. \S+ matches one or more non-whitespace characters, i.e., the component name. This segment matches things like nomad.heartbeat: and nomad.rpc:, with component matching nomad.heartbeat and nomad.rpc
— Note there is a space here. Using \s would also work
(.+) — Matches the remaining log content. (.+) matches one or more characters, i.e., the log message. e.g., node TTL expired: node_id=daf861cc-641d-f0a6-62ee-d954f6edd3a4

After parsing:

"time" => 2023-12-08T21:39:09.718+0800
"level_extracted" => WARN
"component" => nomad.heartbeat
"message" => node TTL expired: node_id=daf861cc-641d-f0a6-62ee-d954f6edd3a4

"time" => 2023-12-07T21:39:04.905+0800
"level_extracted" => ERROR
"component" => nomad.rpc
"message" => multiplex_v2 conn accept failed: error="keepalive timeout"

After parsing, we can set up alerting based on these labels. The alerting condition is: component = nomad.heartbeat, level_extracted =~ WARN|ERROR

The specific LogQL is:

LogQL
count by(job)
(rate(
{unit=“nomad.service”, transport=“stdout”}
| regexp (?P<time>\S+)\s+\[(?P<level>\w+)\]\s+(?P<component>\S+): (.+)
| component = nomad.heartbeat
| level_extracted =~ WARN|ERROR [5m]))

3

Detailed explanation:

The Nomad log stream is: {unit=“nomad.service”, transport=“stdout”}

logql
{unit=“nomad.service”, transport=“stdout”}
| regexp (?P<time>\S+)\s+\[(?P<level>\w+)\]\s+(?P<component>\S+): (.+)
| component = nomad.heartbeat
| level_extracted =~ WARN|ERROR

Filters log entries where component is nomad.heartbeat and level_extracted is WARN|ERROR
Alerts when the per-second heartbeat error count > 3

The final alerting rule is:

LogQL
alert: Nomad HeartBeat Error
for: 1m
annotations:
summary: Heartbeat anomaly between Nomad Server and Client.
description: ‘’
runbook_url: ‘’
labels:
severity: ‘warning’
expr: >-
count by(job) (rate({unit=“nomad.service”, transport=“stdout”} | regexp
(?P<time>\S+)\s+\[(?P<level>\w+)\]\s+(?P<component>\S+): (.+) | component =
nomad.heartbeat | level_extracted =~ WARN|ERROR [5m])) > 3

Done 🎉🎉🎉

Leverage the Grafana UI for LogQL

The Grafana UI has excellent support for LogQL, with comprehensive hints/help and guides, as well as a Builder mode and Explain feature that are very suitable for users unfamiliar with LogQL syntax. Don’t be intimidated by the lengthy LogQL and YAML above — you can use Grafana directly to construct the log-based queries and alerts you need.

Specific Grafana feature enhancements include:

Syntax/Spelling Validation (Query Expression Validation): To speed up writing correct LogQL queries, Grafana 9.4 added a new feature: query expression validation. Intuitively, you can call it the “red squiggly” feature, because it uses the same wavy underline you see in word processors when you make a typo 🙂. With query validation, you no longer need to run a query to see if it’s correct. Instead, you get real-time feedback if the query is invalid. When this happens, the red squiggly shows exactly where the error is and which characters are incorrect. Query expression validation also supports multi-line queries.

Query Expression Validation

Autocomplete: For example, you can see suggested parser types (such as logfmt, JSON) based on your query, helping you write more appropriate queries for your data. Additionally, if you use a parser in your query, all labels (including parser-extracted labels) are suggested in grouped range aggregations (such as sum by()).

Autocomplete

Query History: Loki’s code editor now has query history directly integrated. Once you start writing a new query, your previously run queries are displayed. This feature is especially useful in Explore, where you typically don’t start from scratch but want to build on previous work.

Query History

Label Browser: Browse all labels directly and use them in your queries. This is very useful for quickly exploring and finding labels.

Label Browser

Log Samples: We know that many users running metric queries in Explore want to see sample log lines that contributed to that metric. This is exactly the new feature provided in Grafana 9.4! This helps with the debugging process, primarily by helping you narrow down the scope of metric queries through line filters or label filters based on log line content.

I was actually just starting to learn LogQL myself, and this time I mainly completed the work with the help of Grafana, as shown below:

Grafana + Loki

👍️👍️👍️

Summary

The above covers the basic workflow for implementing alerting with Loki. Before alerting, you typically need to parse and filter logs, and the specific implementation details can be adjusted according to your actual situation.

Finally, be sure to use the Grafana UI when working with LogQL — it makes writing and debugging LogQL much more convenient.

I hope this article has been helpful.

📚️ References

Observability

#K8S #CloudNative #Monitoring #Logging #Observability #Grafana #Loki

Grafana Series - Loki - Implementing Alerting Based on Logs

https://e-whisper.com/posts/44430/

Author

east4ming

Posted on

December 10, 2023

Licensed under

How to Monitor OpenSearch in K8s Previous

Terraform Series - Iterating Over Blocks with Dynamic Blocks Next