我正在运行两个 postgresql 11 服务器 - 主服务器和从服务器(设置逻辑复制)。
我面临的问题是,今天经过数周不间断的工作,奴隶不再同步,并出现了以下错误消息:
2019-09-16 07:39:44.332 CEST [30117] ERROR: could not send data to WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2019-09-16 07:39:44.539 CEST [12932] LOG: logical replication apply worker for subscription "logical_from_master" has started
2019-09-16 07:39:44.542 CEST [27972] LOG: background worker "logical replication worker" (PID 30117) exited with exit code 1
我之前确实看到过这个错误信息,我的过程是在 master 上增加wal_sender_timeout
(更多详细信息请见此处:https://dba.stackexchange.com/questions/245855/logical-replication-in-postgresql-server-closed-the-connection-unexpectedly)
因此我想恢复复制,但是复制状态停留在追赶状态:
master=# select * from pg_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state
-------+----------+---------+-------------------+---------------+-----------------+-------------+-------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------------+-----------------+-----------------+---------------+------------
86864 | 16680 | my_user | logical_from_master | 10.10.10.10 | | 46110 | 2019-09-16 12:45:56.491325+02 | | catchup | D55/FA04D4B8 | D55/F9E74158 | D55/F9E44CD8 | D55/F9E74030 | 00:00:03.603104 | 00:00:03.603104 | 00:00:03.603104 | 0 | async
(1 row)
我尝试重启从服务器几次,使用不同的订阅启用和禁用组合 - 都无济于事,复制状态一直保持开启catchup
。我可以看到sent_lsn
值write_lsn
在变化,所以有东西正在发送...
我不确定这是否相关,但在从属设备上我尝试以下操作来查看 wal 接收器状态:
slave=# select * from pg_stat_wal_receiver ;
pid | status | receive_start_lsn | receive_start_tli | received_lsn | received_tli | last_msg_send_time | last_msg_receipt_time | latest_end_lsn | latest_end_time | slot_name | sender_host | sender_port | conninfo
-----+--------+-------------------+-------------------+--------------+--------------+--------------------+-----------------------+----------------+-----------------+-----------+-------------+-------------+----------
(0 rows)
但是当我做简单的事情时ps -aux | grep postgres
:
postgres 26087 0.0 1.1 348788 46160 ? S 12:45 0:00 /usr/lib/postgresql/11/bin/postgres -D /var/lib/postgresql/11/slave_logical -c config_file=/etc/postgresql/11/slave_logical/postgresql.conf
postgres 26089 0.0 0.2 349108 12080 ? Ss 12:45 0:00 postgres: 11/slave_logical: checkpointer
postgres 26090 0.0 0.0 348996 3988 ? Ss 12:45 0:00 postgres: 11/slave_logical: background writer
postgres 26091 0.0 0.1 348788 7204 ? Ss 12:45 0:00 postgres: 11/slave_logical: walwriter
postgres 26092 0.0 0.1 349740 4396 ? Ss 12:45 0:00 postgres: 11/slave_logical: autovacuum launcher
postgres 26093 0.0 0.0 170024 3028 ? Ss 12:45 0:00 postgres: 11/slave_logical: stats collector
postgres 26094 0.0 0.1 349572 4516 ? Ss 12:45 0:00 postgres: 11/slave_logical: logical replication launcher
postgres 26095 0.0 0.2 350164 10036 ? Ss 12:45 0:00 postgres: 11/slave_logical: my_user db ::1(56086) idle
postgres 26125 4.5 0.9 359876 36884 ? Ds 12:45 0:20 postgres: 11/slave_logical: logical replication worker for subscription 37614
因此,要么是逻辑复制没有产生任何结果pg_stat_wal_receiver
,要么是出于某种原因,工作者详细信息没有进入系统视图。同样,进程似乎在那里(或者至少我在阅读ps -aux
上面的输出后猜测)
这是我的从属配置:
wal_level=logical
max_replication_slots=2
max_logical_replication_workers=4
wal_receiver_timeout=1200000
这是我的主人:
wal_level=logical
max_replication_slots=10
max_wal_senders=10
# maximum wait time in milliseconds that the walsender process on the active master
# waits for a status message from the walreceiver process on the standby master.
wal_sender_timeout=1200000
我不知道该做什么(更糟糕的是,在这个阶段我不知道下一步该检查什么......)
你能帮助我理解我应该怎么做才能让我的从属设备赶上进度并回到原来的streaming
状态吗?
编辑(12 小时后)
当我早上检查时,同步仍处于catchup
状态
master=# select * from pg_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state
-------+----------+---------+-------------------+---------------+-----------------+-------------+-------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------+-----------+------------+---------------+------------
12965 | 16680 | my_user | logical_from_master | 10.10.10.10 | | 46630 | 2019-09-17 06:40:18.801262+02 | | catchup | D56/248E13A0 | D56/247E3908 | D56/247E3908 | D56/247E3908 | | | | 0 | async
(1 row)
但当我 60 秒后再次检查时,结果集是空的......
日志现在显示同一错误的多个版本:
2019-09-16 22:43:33.841 CEST [20260] ERROR: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2019-09-16 22:43:33.959 CEST [26087] LOG: background worker "logical replication worker" (PID 20260) exited with exit code 1
2019-09-16 22:43:34.112 CEST [3510] LOG: logical replication apply worker for subscription "logical_from_master" has started
2019-09-16 23:12:01.919 CEST [3510] ERROR: could not send data to WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2019-09-16 23:12:02.073 CEST [26087] LOG: background worker "logical replication worker" (PID 3510) exited with exit code 1
2019-09-16 23:12:02.229 CEST [4467] LOG: logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 00:27:01.990 CEST [4467] ERROR: could not send data to WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2019-09-17 00:27:02.131 CEST [26087] LOG: background worker "logical replication worker" (PID 4467) exited with exit code 1
2019-09-17 00:27:02.177 CEST [6917] LOG: logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 01:05:35.121 CEST [6917] ERROR: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2019-09-17 01:05:35.220 CEST [26087] LOG: background worker "logical replication worker" (PID 6917) exited with exit code 1
2019-09-17 01:05:35.252 CEST [8204] LOG: logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 01:49:08.388 CEST [8204] ERROR: could not send data to WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2019-09-17 01:49:08.520 CEST [26087] LOG: background worker "logical replication worker" (PID 8204) exited with exit code 1
2019-09-17 01:49:08.583 CEST [9549] LOG: logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 03:06:19.601 CEST [9549] ERROR: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2019-09-17 03:06:19.732 CEST [26087] LOG: background worker "logical replication worker" (PID 9549) exited with exit code 1
2019-09-17 03:06:19.754 CEST [12120] LOG: logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 03:58:48.184 CEST [12120] ERROR: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2019-09-17 03:58:48.254 CEST [13781] LOG: logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 03:58:48.318 CEST [26087] LOG: background worker "logical replication worker" (PID 12120) exited with exit code 1
2019-09-17 04:27:12.838 CEST [13781] ERROR: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2019-09-17 04:27:12.931 CEST [26087] LOG: background worker "logical replication worker" (PID 13781) exited with exit code 1
2019-09-17 04:27:12.967 CEST [14736] LOG: logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 04:55:48.923 CEST [14736] ERROR: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2019-09-17 04:55:49.032 CEST [15686] LOG: logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 04:55:49.043 CEST [26087] LOG: background worker "logical replication worker" (PID 14736) exited with exit code 1
2019-09-17 05:41:48.526 CEST [15686] ERROR: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2019-09-17 05:41:48.590 CEST [17164] LOG: logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 05:41:48.638 CEST [26087] LOG: background worker "logical replication worker" (PID 15686) exited with exit code 1
2019-09-17 06:03:32.584 CEST [17164] ERROR: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2019-09-17 06:03:32.642 CEST [17849] LOG: logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 06:03:32.670 CEST [26087] LOG: background worker "logical replication worker" (PID 17164) exited with exit code 1
2019-09-17 06:40:18.732 CEST [17849] ERROR: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
为了使复制显示catchup
在主服务器上,我现在必须先重新启动从服务器......
编辑 2 - 我认为变化的速度也许sent_lsn
有助于确定我的服务器有多忙:
master=# select now(), * from pg_stat_replication ;
now | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state
-------------------------------+--------+----------+---------+-------------------+---------------+-----------------+-------------+------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------+-----------+------------+---------------+------------
2019-09-17 08:31:02.547143+02 | 100194 | 16680 | my_user | logical_from_master | 10.10.10.10 | | 46766 | 2019-09-17 08:23:05.01474+02 | | catchup | D56/24B1BC88 | D56/24A39B58 | D56/24A39B58 | D56/24A39B58 | | | | 0 | async
(1 row)
master=# select now(), * from pg_stat_replication ;
now | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state
------------------------------+--------+----------+---------+-------------------+---------------+-----------------+-------------+------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------+-----------+------------+---------------+------------
2019-09-17 08:34:02.45418+02 | 100194 | 16680 | my_user | logical_from_master | 10.10.10.10 | | 46766 | 2019-09-17 08:23:05.01474+02 | | catchup | D56/24B54958 | D56/24A39B58 | D56/24A39B58 | D56/24A39B58 | | | | 0 | async
(1 row)
master=# select now(), * from pg_stat_replication ;
now | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state
-------------------------------+--------+----------+---------+-------------------+---------------+-----------------+-------------+------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------+-----------+------------+---------------+------------
2019-09-17 08:41:01.815997+02 | 100194 | 16680 | my_user | logical_from_master | 10.10.10.10 | | 46766 | 2019-09-17 08:23:05.01474+02 | | catchup | D56/24B778B0 | D56/24A39B58 | D56/24A39B58 | D56/24A39B58 | | | | 0 | async
(1 row)
当我看到上述内容时,我担心这write_lsn
不合逻辑sent_lsn
——我不知道这是否正确。
检查点每 15 分钟发生一次,但我看不到太多 IO 通过iostat
。
当我第四次执行时,select * from pg_stat_replication
什么都没有出现(表明复制自行终止)。巧合的是,检查点在那个时候完成(我重新启动了从属复制并等到下一个检查点完成时间,这一次工作程序没有死亡)。
在寻找检查点时,我在日志中注意到一个令人担忧的迹象:
2019-09-17 08:49:02.155 CEST,,,35656,,5d382555.8b48,10873,,2019-07-24 11:31:01 CEST,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""
2019-09-17 08:56:32.086 CEST,,,35656,,5d382555.8b48,10874,,2019-07-24 11:31:01 CEST,,0,LOG,00000,"checkpoint complete: wrote 5671 buffers (0.1%); 0 WAL file(s) added, 0 removed, 1 recycled; write=449.927 s, sync=0.000 s, total=449.930 s; sync files=138, longest=0.000 s, average=0.000 s; distance=19690 kB, estimate=398335 kB",,,,,,,,,""
2019-09-17 09:04:02.186 CEST,,,35656,,5d382555.8b48,10875,,2019-07-24 11:31:01 CEST,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""
2019-09-17 09:04:51.376 CEST,,,35656,,5d382555.8b48,10876,,2019-07-24 11:31:01 CEST,,0,LOG,00000,"checkpoint complete: wrote 490 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=49.187 s, sync=0.000 s, total=49.190 s; sync files=41, longest=0.000 s, average=0.000 s; distance=3165 kB, estimate=358818 kB",,,,,,,,,""
session_start_time 列似乎指的是 7 月 24 日的日期。巧合的是,这正是我在主服务器上开始复制的时间...