Recently, however, a misoperation caused WAL corruption in the Postgres database, and the Postgres Pod entered a CrashBackoffLoop. The specific errors were as follows:
Postgres shutdown exit code 1:
1 2 3 4 5
2023-09-2702:32:17.127 UTC [1] LOG: received fast shutdown request 2023-09-2702:32:17.181 UTC [1] LOG: aborting any active transactions 2023-09-2702:32:17.434 UTC [1] LOG: background worker "logical replication launcher" (PID 26) exited withexit code 1 2023-09-2702:32:17.481 UTC [21] LOG: shutting down 2023-09-2702:32:17.880 UTC [1] LOG: databasesystemis shut down
Postgres “invalid resource manager ID in primary checkpoint record” and “could not locate a valid checkpoint record”
1 2 3 4 5 6 7 8 9 10
2023-09-2702:33:23.189 UTC [1] LOG: starting PostgreSQL 13.5on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.3.1_git20211027) 10.3.120211027, 64-bit 2023-09-2702:33:23.190 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432 2023-09-2702:33:23.190 UTC [1] LOG: listening on IPv6 address "::", port 5432 2023-09-2702:33:23.199 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2023-09-2702:33:23.210 UTC [21] LOG: databasesystem was shut down at 2023-09-2702:32:22 UTC 2023-09-2702:33:23.210 UTC [21] LOG: invalid resource manager ID inprimarycheckpointrecord 2023-09-2702:33:23.210 UTC [21] PANIC: could not locate a validcheckpointrecord 2023-09-2702:33:24.657 UTC [1] LOG: startup process (PID 21) was terminated by signal 6: Aborted 2023-09-2702:33:24.657 UTC [1] LOG: aborting startup due to startup process failure 2023-09-2702:33:24.659 UTC [1] LOG: databasesystemis shut down
As shown above, the WAL files are corrupted. How do we recover from this?
Recovery Steps
│ 🐾Warning:
│
│ The goal is to get Postgres started so the application can resume normal operation. Some data loss may occur.
This is a TTRSS feed application for my personal use only, so as long as it starts up, a little data loss is acceptable.
First, since the Postgres Pod is in a CrashBackoffLoop and no operations can be performed, the top priority is to keep the Pod running without shutting down. This can be achieved by adding some commands to the Deployment, as follows:
As shown above, sh -c tail -f /dev/null keeps the Pod running. You could also use similar commands like while true; do sleep 30; done; to achieve the same effect.
Once the Pod is running stably, exec into it using kubectl exec -it:
After success, exit the Pod. Remove the command and args from the Deployment, and Postgres should start normally:
1 2 3 4 5 6
2023-09-2704:03:25.172 UTC [1] LOG: starting PostgreSQL 13.5on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.3.1_git20211027) 10.3.120211027, 64-bit 2023-09-2704:03:25.173 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432 2023-09-2704:03:25.173 UTC [1] LOG: listening on IPv6 address "::", port 5432 2023-09-2704:03:25.179 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2023-09-2704:03:25.187 UTC [20] LOG: databasesystem was shut down at 2023-09-2704:02:42 UTC 2023-09-2704:03:25.210 UTC [1] LOG: databasesystemis ready to accept connections