PostgreSQL 集群无法启动,没有问题日志,WAL 重做消息

PostgreSQL 集群无法启动,没有问题日志,WAL 重做消息

我在使用一个没有自己设置或配置的系统时遇到了问题,但系统的某些部分似乎通常位于或多或少默认的位置。我目前的理论是,当服务器上的存储空间完全填满时,(Ubuntu 16.04.3 LTS)系统上的 PostgreSQL 数据库停止工作。

此后存储量有所增加,一些旧文件已被删除,但 PostgreSQL 服务/集群拒绝启动。见下文:

ubuntu@ip-<my-ip-here>:~$ sudo pg_ctlcluster 11 main start
Job for [email protected] failed because a configured resource limit was exceeded. See "systemctl status [email protected]" and "journalctl -xe" for details.

ubuntu@ip-<my-ip-here>:~$ systemctl status [email protected][email protected] - PostgreSQL Cluster 11-main
   Loaded: loaded (/lib/systemd/system/[email protected]; disabled; vendor preset: enabled)
   Active: failed (Result: resources) since Mon 2023-01-09 16:24:10 AEST; 7s ago
  Process: 2281 ExecStart=/usr/bin/pg_ctlcluster --skip-systemctl-redirect %i start (code=exited, status=1/FAILURE)

Jan 09 16:24:10 ip-172-31-23-196 postgresql@11-main[2281]: 2023-01-09 16:24:10.052 AEST [2289] CONTEXT:  WAL redo at 7F/200AA78 for XLOG/FPI:
Jan 09 16:24:10 ip-172-31-23-196 postgresql@11-main[2281]: 2023-01-09 16:24:10.053 AEST [2288] LOG:  startup process (PID 2289) exited with exit code 1
Jan 09 16:24:10 ip-172-31-23-196 postgresql@11-main[2281]: 2023-01-09 16:24:10.053 AEST [2288] LOG:  aborting startup due to startup process failure
Jan 09 16:24:10 ip-172-31-23-196 postgresql@11-main[2281]: 2023-01-09 16:24:10.059 AEST [2288] LOG:  database system is shut down
Jan 09 16:24:10 ip-172-31-23-196 postgresql@11-main[2281]: pg_ctl: could not start server
Jan 09 16:24:10 ip-172-31-23-196 postgresql@11-main[2281]: Examine the log output.
Jan 09 16:24:10 ip-172-31-23-196 systemd[1]: [email protected]: Can't open PID file /run/postgresql/11-main.pid (yet?) after start: No such file or directory
Jan 09 16:24:10 ip-172-31-23-196 systemd[1]: Failed to start PostgreSQL Cluster 11-main.
Jan 09 16:24:10 ip-172-31-23-196 systemd[1]: [email protected]: Unit entered failed state.
Jan 09 16:24:10 ip-172-31-23-196 systemd[1]: [email protected]: Failed with result 'resources'.

Can't open PID file /run/postgresql/11-main.pid (yet?) after start: No such file or directory部分看起来像是一个转移注意力的话题——在实际问题已经发生后才出现的问题。(此外,如果我创建该文件,并赋予用户权利和所有权postgres,它会按预期被删除)

因此,我认为问题出在CONTEXT: WAL redo at 7F/200AA78 for XLOG/FPI:后面的LOG: startup process (PID 2289) exited with exit codeand上aborting startup due to startup process failure——在“WAL redo”期间显然出现了错误。

这句话Examine the log output.听起来很有道理,但我在任何我期望的地方都找不到日志:

ubuntu@ip-<my-ip-here>:~$ ls /etc/postgresql/11/main
conf.d  environment  pg_ctl.conf  pg_hba.conf  pg_ident.conf  postgresql.conf  start.conf

并且它postgresql.conf没有提供任何线索告诉我在哪里可以找到它 - 所以我认为它无法写入它,是吗?

欢迎任何有关如何让服务器继续前进的建议。我不会介意丢失它试图从 WAL 重做中整合的任何数据 - 我主要感兴趣的是让系统恢复正常运行。

答案1

正如用户@user1686 正确指出的那样,Ubuntu 上 PostgreSQL 的日志可以在 中找到/var/log/postgresqlpostgresql-11-main.log因为我对数据库感兴趣main并且正在运行 PostgreSQL 11。

该日志文件记录了以下内容:

2023-01-10 10:36:39.271 AEST [11564] LOG:  database system was interrupted while in recovery at 2023-01-09 16:24:10 AEST
2023-01-10 10:36:39.271 AEST [11564] HINT:  This probably means that some data is corrupted and you will have to use the last backup for recovery.
2023-01-10 10:36:39.315 AEST [11564] LOG:  could not open directory "pg_tblspc/17293/lost+found": Permission denied
2023-01-10 10:36:39.319 AEST [11564] LOG:  database system was not properly shut down; automatic recovery in progress
2023-01-10 10:36:39.323 AEST [11564] LOG:  redo starts at 7F/1FA68D0
2023-01-10 10:36:39.335 AEST [11564] FATAL:  could not extend file "pg_tblspc/17293/PG_11_201809051/16386/30550227": No space left on device

这解释了为什么重做失败,因为设备上没有空间容纳pg_tblspc/17293

执行下列操作后:

sudo ls /var/lib/postgresql/11/main/pg_tblspc -alF
...
lrwxrwxrwx  1 postgres postgres   11 Sep  4  2019 17293 -> /trace_data/

和:

ubuntu@ip-<my-ip-here>:~$ df /trace_data/
Filesystem     1K-blocks      Used Available Use% Mounted on
/dev/xvdf      206293688 196609516         0 100% /trace_data
ubuntu@ip-<my-ip-here>:~$ lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
xvda    202:0    0  100G  0 disk
└─xvda1 202:1    0  100G  0 part /
xvdf    202:80   0  200G  0 disk /trace_data

我意识到我必须增加卷的大小xvdf,我可以通过增加虚拟机上的卷大小来实现,然后使用文件系统sudo resize2fs /dev/xvdf(因为它是一个ext4文件系统)来扩展它。之后,我能够重新启动 PostgreSQL 集群,并解决我的进一步问题。

相关内容