Postgresql 11 逻辑复制 - 卡在“追赶”状态

2024-6-26 • tag-icon

我正在运行两个 postgresql 11 服务器 - 主服务器和从服务器（设置逻辑复制）。

我面临的问题是，今天经过数周不间断的工作，奴隶不再同步，并出现了以下错误消息：

2019-09-16 07:39:44.332 CEST [30117] ERROR:  could not send data to WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-16 07:39:44.539 CEST [12932] LOG:  logical replication apply worker for subscription "logical_from_master" has started
2019-09-16 07:39:44.542 CEST [27972] LOG:  background worker "logical replication worker" (PID 30117) exited with exit code 1

我之前确实看到过这个错误信息，我的过程是在 master 上增加wal_sender_timeout（更多详细信息请见此处：https://dba.stackexchange.com/questions/245855/logical-replication-in-postgresql-server-closed-the-connection-unexpectedly）

因此我想恢复复制，但是复制状态停留在追赶状态：

master=# select * from pg_stat_replication;
  pid  | usesysid | usename | application_name  |  client_addr  | client_hostname | client_port |         backend_start         | backend_xmin |  state  |   sent_lsn   |  write_lsn   |  flush_lsn   |  replay_lsn  |    write_lag    |    flush_lag    |   replay_lag    | sync_priority | sync_state
-------+----------+---------+-------------------+---------------+-----------------+-------------+-------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------------+-----------------+-----------------+---------------+------------
 86864 |    16680 | my_user    | logical_from_master | 10.10.10.10 |                 |       46110 | 2019-09-16 12:45:56.491325+02 |              | catchup | D55/FA04D4B8 | D55/F9E74158 | D55/F9E44CD8 | D55/F9E74030 | 00:00:03.603104 | 00:00:03.603104 | 00:00:03.603104 |             0 | async
(1 row)

我尝试重启从服务器几次，使用不同的订阅启用和禁用组合 - 都无济于事，复制状态一直保持开启catchup。我可以看到sent_lsn值write_lsn在变化，所以有东西正在发送...

我不确定这是否相关，但在从属设备上我尝试以下操作来查看 wal 接收器状态：

slave=# select * from pg_stat_wal_receiver ;
 pid | status | receive_start_lsn | receive_start_tli | received_lsn | received_tli | last_msg_send_time | last_msg_receipt_time | latest_end_lsn | latest_end_time | slot_name | sender_host | sender_port | conninfo
-----+--------+-------------------+-------------------+--------------+--------------+--------------------+-----------------------+----------------+-----------------+-----------+-------------+-------------+----------
(0 rows)

但是当我做简单的事情时ps -aux | grep postgres：

postgres 26087  0.0  1.1 348788 46160 ?        S    12:45   0:00 /usr/lib/postgresql/11/bin/postgres -D /var/lib/postgresql/11/slave_logical -c config_file=/etc/postgresql/11/slave_logical/postgresql.conf
postgres 26089  0.0  0.2 349108 12080 ?        Ss   12:45   0:00 postgres: 11/slave_logical: checkpointer
postgres 26090  0.0  0.0 348996  3988 ?        Ss   12:45   0:00 postgres: 11/slave_logical: background writer
postgres 26091  0.0  0.1 348788  7204 ?        Ss   12:45   0:00 postgres: 11/slave_logical: walwriter
postgres 26092  0.0  0.1 349740  4396 ?        Ss   12:45   0:00 postgres: 11/slave_logical: autovacuum launcher
postgres 26093  0.0  0.0 170024  3028 ?        Ss   12:45   0:00 postgres: 11/slave_logical: stats collector
postgres 26094  0.0  0.1 349572  4516 ?        Ss   12:45   0:00 postgres: 11/slave_logical: logical replication launcher
postgres 26095  0.0  0.2 350164 10036 ?        Ss   12:45   0:00 postgres: 11/slave_logical: my_user db ::1(56086) idle
postgres 26125  4.5  0.9 359876 36884 ?        Ds   12:45   0:20 postgres: 11/slave_logical: logical replication worker for subscription 37614

因此，要么是逻辑复制没有产生任何结果pg_stat_wal_receiver，要么是出于某种原因，工作者详细信息没有进入系统视图。同样，进程似乎在那里（或者至少我在阅读ps -aux上面的输出后猜测）

这是我的从属配置：

wal_level=logical
max_replication_slots=2
max_logical_replication_workers=4

wal_receiver_timeout=1200000

这是我的主人：

wal_level=logical

max_replication_slots=10
max_wal_senders=10

# maximum wait time in milliseconds that the walsender process on the active master
# waits for a status message from the walreceiver process on the standby master.
wal_sender_timeout=1200000

我不知道该做什么（更糟糕的是，在这个阶段我不知道下一步该检查什么......）

你能帮助我理解我应该怎么做才能让我的从属设备赶上进度并回到原来的streaming状态吗？

编辑（12 小时后）

当我早上检查时，同步仍处于catchup状态

master=# select * from pg_stat_replication;
  pid  | usesysid | usename | application_name  |  client_addr  | client_hostname | client_port |         backend_start         | backend_xmin |  state  |   sent_lsn   |  write_lsn   |  flush_lsn   |  replay_lsn  | write_lag | flush_lag | replay_lag | sync_priority | sync_state
-------+----------+---------+-------------------+---------------+-----------------+-------------+-------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------+-----------+------------+---------------+------------
 12965 |    16680 | my_user    | logical_from_master | 10.10.10.10 |                 |       46630 | 2019-09-17 06:40:18.801262+02 |              | catchup | D56/248E13A0 | D56/247E3908 | D56/247E3908 | D56/247E3908 |           |           |            |             0 | async
(1 row)

但当我 60 秒后再次检查时，结果集是空的......

日志现在显示同一错误的多个版本：

2019-09-16 22:43:33.841 CEST [20260] ERROR:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-16 22:43:33.959 CEST [26087] LOG:  background worker "logical replication worker" (PID 20260) exited with exit code 1
2019-09-16 22:43:34.112 CEST [3510] LOG:  logical replication apply worker for subscription "logical_from_master" has started
2019-09-16 23:12:01.919 CEST [3510] ERROR:  could not send data to WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-16 23:12:02.073 CEST [26087] LOG:  background worker "logical replication worker" (PID 3510) exited with exit code 1
2019-09-16 23:12:02.229 CEST [4467] LOG:  logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 00:27:01.990 CEST [4467] ERROR:  could not send data to WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-17 00:27:02.131 CEST [26087] LOG:  background worker "logical replication worker" (PID 4467) exited with exit code 1
2019-09-17 00:27:02.177 CEST [6917] LOG:  logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 01:05:35.121 CEST [6917] ERROR:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-17 01:05:35.220 CEST [26087] LOG:  background worker "logical replication worker" (PID 6917) exited with exit code 1
2019-09-17 01:05:35.252 CEST [8204] LOG:  logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 01:49:08.388 CEST [8204] ERROR:  could not send data to WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-17 01:49:08.520 CEST [26087] LOG:  background worker "logical replication worker" (PID 8204) exited with exit code 1
2019-09-17 01:49:08.583 CEST [9549] LOG:  logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 03:06:19.601 CEST [9549] ERROR:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-17 03:06:19.732 CEST [26087] LOG:  background worker "logical replication worker" (PID 9549) exited with exit code 1
2019-09-17 03:06:19.754 CEST [12120] LOG:  logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 03:58:48.184 CEST [12120] ERROR:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-17 03:58:48.254 CEST [13781] LOG:  logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 03:58:48.318 CEST [26087] LOG:  background worker "logical replication worker" (PID 12120) exited with exit code 1
2019-09-17 04:27:12.838 CEST [13781] ERROR:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-17 04:27:12.931 CEST [26087] LOG:  background worker "logical replication worker" (PID 13781) exited with exit code 1
2019-09-17 04:27:12.967 CEST [14736] LOG:  logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 04:55:48.923 CEST [14736] ERROR:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-17 04:55:49.032 CEST [15686] LOG:  logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 04:55:49.043 CEST [26087] LOG:  background worker "logical replication worker" (PID 14736) exited with exit code 1
2019-09-17 05:41:48.526 CEST [15686] ERROR:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-17 05:41:48.590 CEST [17164] LOG:  logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 05:41:48.638 CEST [26087] LOG:  background worker "logical replication worker" (PID 15686) exited with exit code 1
2019-09-17 06:03:32.584 CEST [17164] ERROR:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-17 06:03:32.642 CEST [17849] LOG:  logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 06:03:32.670 CEST [26087] LOG:  background worker "logical replication worker" (PID 17164) exited with exit code 1
2019-09-17 06:40:18.732 CEST [17849] ERROR:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.

为了使复制显示catchup在主服务器上，我现在必须先重新启动从服务器......

编辑 2 - 我认为变化的速度也许sent_lsn有助于确定我的服务器有多忙：

master=# select now(), * from pg_stat_replication ;
              now              |  pid   | usesysid | usename | application_name  |  client_addr  | client_hostname | client_port |        backend_start         | backend_xmin |  state  |   sent_lsn   |  write_lsn   |  flush_lsn   |  replay_lsn  | write_lag | flush_lag | replay_lag | sync_priority | sync_state
-------------------------------+--------+----------+---------+-------------------+---------------+-----------------+-------------+------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------+-----------+------------+---------------+------------
 2019-09-17 08:31:02.547143+02 | 100194 |    16680 | my_user    | logical_from_master | 10.10.10.10 |                 |       46766 | 2019-09-17 08:23:05.01474+02 |              | catchup | D56/24B1BC88 | D56/24A39B58 | D56/24A39B58 | D56/24A39B58 |           |           |            |             0 | async
(1 row)

master=# select now(), * from pg_stat_replication ;
             now              |  pid   | usesysid | usename | application_name  |  client_addr  | client_hostname | client_port |        backend_start         | backend_xmin |  state  |   sent_lsn   |  write_lsn   |  flush_lsn   |  replay_lsn  | write_lag | flush_lag | replay_lag | sync_priority | sync_state
------------------------------+--------+----------+---------+-------------------+---------------+-----------------+-------------+------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------+-----------+------------+---------------+------------
 2019-09-17 08:34:02.45418+02 | 100194 |    16680 | my_user    | logical_from_master | 10.10.10.10 |                 |       46766 | 2019-09-17 08:23:05.01474+02 |              | catchup | D56/24B54958 | D56/24A39B58 | D56/24A39B58 | D56/24A39B58 |           |           |            |             0 | async
(1 row)

master=# select now(), * from pg_stat_replication ;
              now              |  pid   | usesysid | usename | application_name  |  client_addr  | client_hostname | client_port |        backend_start         | backend_xmin |  state  |   sent_lsn   |  write_lsn   |  flush_lsn   |  replay_lsn  | write_lag | flush_lag | replay_lag | sync_priority | sync_state
-------------------------------+--------+----------+---------+-------------------+---------------+-----------------+-------------+------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------+-----------+------------+---------------+------------
 2019-09-17 08:41:01.815997+02 | 100194 |    16680 | my_user    | logical_from_master | 10.10.10.10 |                 |       46766 | 2019-09-17 08:23:05.01474+02 |              | catchup | D56/24B778B0 | D56/24A39B58 | D56/24A39B58 | D56/24A39B58 |           |           |            |             0 | async
(1 row)

当我看到上述内容时，我担心这write_lsn不合逻辑sent_lsn——我不知道这是否正确。

检查点每 15 分钟发生一次，但我看不到太多 IO 通过iostat。

当我第四次执行时，select * from pg_stat_replication什么都没有出现（表明复制自行终止）。巧合的是，检查点在那个时候完成（我重新启动了从属复制并等到下一个检查点完成时间，这一次工作程序没有死亡）。

在寻找检查点时，我在日志中注意到一个令人担忧的迹象：

2019-09-17 08:49:02.155 CEST,,,35656,,5d382555.8b48,10873,,2019-07-24 11:31:01 CEST,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""
2019-09-17 08:56:32.086 CEST,,,35656,,5d382555.8b48,10874,,2019-07-24 11:31:01 CEST,,0,LOG,00000,"checkpoint complete: wrote 5671 buffers (0.1%); 0 WAL file(s) added, 0 removed, 1 recycled; write=449.927 s, sync=0.000 s, total=449.930 s; sync files=138, longest=0.000 s, average=0.000 s; distance=19690 kB, estimate=398335 kB",,,,,,,,,""
2019-09-17 09:04:02.186 CEST,,,35656,,5d382555.8b48,10875,,2019-07-24 11:31:01 CEST,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""
2019-09-17 09:04:51.376 CEST,,,35656,,5d382555.8b48,10876,,2019-07-24 11:31:01 CEST,,0,LOG,00000,"checkpoint complete: wrote 490 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=49.187 s, sync=0.000 s, total=49.190 s; sync files=41, longest=0.000 s, average=0.000 s; distance=3165 kB, estimate=358818 kB",,,,,,,,,""

session_start_time 列似乎指的是 7 月 24 日的日期。巧合的是，这正是我在主服务器上开始复制的时间...

相关内容