How to Recover a Corrupted Postgres Database Running in a Container

This article was last updated on: June 29, 2026 pm

Introduction

In Deploying a Full Self-Hosted RSS Solution on K8S - RssHub + Tiny Tiny RSS, I described how to deploy RssHub + Tiny Tiny RSS into a K8s cluster. TTRSS uses Postgres for data storage, which was also deployed as a container in K8s.

Recently, however, a misoperation caused WAL corruption in the Postgres database, and the Postgres Pod entered a CrashBackoffLoop. The specific errors were as follows:

Postgres shutdown exit code 1:

2023-09-27 02:32:17.127 UTC [1] LOG:  received fast shutdown request
2023-09-27 02:32:17.181 UTC [1] LOG:  aborting any active transactions
2023-09-27 02:32:17.434 UTC [1] LOG:  background worker "logical replication launcher" (PID 26) exited with exit code 1
2023-09-27 02:32:17.481 UTC [21] LOG:  shutting down
2023-09-27 02:32:17.880 UTC [1] LOG:  database system is shut down

Postgres “invalid resource manager ID in primary checkpoint record” and “could not locate a valid checkpoint record”

2023-09-27 02:33:23.189 UTC [1] LOG:  starting PostgreSQL 13.5 on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.3.1_git20211027) 10.3.1 20211027, 64-bit
2023-09-27 02:33:23.190 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2023-09-27 02:33:23.190 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2023-09-27 02:33:23.199 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-09-27 02:33:23.210 UTC [21] LOG:  database system was shut down at 2023-09-27 02:32:22 UTC
2023-09-27 02:33:23.210 UTC [21] LOG:  invalid resource manager ID in primary checkpoint record
2023-09-27 02:33:23.210 UTC [21] PANIC:  could not locate a valid checkpoint record
2023-09-27 02:33:24.657 UTC [1] LOG:  startup process (PID 21) was terminated by signal 6: Aborted
2023-09-27 02:33:24.657 UTC [1] LOG:  aborting startup due to startup process failure
2023-09-27 02:33:24.659 UTC [1] LOG:  database system is shut down

As shown above, the WAL files are corrupted. How do we recover from this?

Recovery Steps

│ 🐾Warning:
│
│ The goal is to get Postgres started so the application can resume normal operation. Some data loss may occur.

This is a TTRSS feed application for my personal use only, so as long as it starts up, a little data loss is acceptable.

First, since the Postgres Pod is in a CrashBackoffLoop and no operations can be performed, the top priority is to keep the Pod running without shutting down. This can be achieved by adding some commands to the Deployment, as follows:

apiVersion: apps/v1
kind: Deployment
metadata:
  ...
spec:
  ...
  template:
    spec:
      containers:
      - image: postgres:13-alpine
        imagePullPolicy: IfNotPresent
        name: postgres
        command: ["sh"]
        args: ["-c", "tail -f /dev/null"]
...

As shown above, sh -c tail -f /dev/null keeps the Pod running. You could also use similar commands like while true; do sleep 30; done; to achieve the same effect.

Once the Pod is running stably, exec into it using kubectl exec -it:

1	`k3s kubectl exec -it database-postgres-56cff865bb-92pcx -n rsshub -- /bin/sh`

Then switch to the postgres user:

1	`su - postgres`

│ 🐾Warning:
│
│ You must switch to the postgres user before running the following commands.

From here it’s straightforward — use pg_resetwal to recover the WAL:

First, do a --dry-run to check the output:

1	`pg_resetwal --dry-run /var/lib/postgresql/data/`

If the result looks as expected, run it for real:

1 2	`pg_resetwal /var/lib/postgresql/data/ Write-ahead log reset`

After success, exit the Pod. Remove the command and args from the Deployment, and Postgres should start normally:

2023-09-27 04:03:25.172 UTC [1] LOG:  starting PostgreSQL 13.5 on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.3.1_git20211027) 10.3.1 20211027, 64-bit
2023-09-27 04:03:25.173 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2023-09-27 04:03:25.173 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2023-09-27 04:03:25.179 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-09-27 04:03:25.187 UTC [20] LOG:  database system was shut down at 2023-09-27 04:02:42 UTC
2023-09-27 04:03:25.210 UTC [1] LOG:  database system is ready to accept connections

Done 🎉🎉🎉

CloudNative

#K8S #BestPractice #RSS #RssHub #TTRss #DB #Postgres

How to Recover a Corrupted Postgres Database Running in a Container

https://e-whisper.com/posts/53679/

Author

east4ming

Posted on

September 27, 2023

Licensed under

Terraform Series - Iterating Over Blocks with Dynamic Blocks Previous

Several Approaches to Encrypting K8s Secrets Next