我不得不从 Ubuntu 16.04 升级一个集群。它在 18.04 和 20.04 上工作正常,但现在在 22.04 上它不会对 DRBD 设备进行故障转移。将资源置于维护模式并执行手动drbdadm secondary/primary
操作可以立即生效,不会出现问题。但是,当将一个节点置于待机状态时,资源会失败并被隔离。
这发生在装有 Pacemaker 2.1.2 和 Corosync 3.1.16 的 Ubuntu 22.04.2 LTS 上。内核模块的 DRBD 版本为 8.4.11,drbd-utils 的版本为 9.15.0。DRBD 和 Corosync 的配置文件不包含任何有趣的内容。我用它来crmsh
管理集群。
我可以使用以下相关设置将情况简化为双节点设置。
node 103: server103
node 104: server104
primitive res_2-1_drbd_OC ocf:linbit:drbd \
params drbd_resource=OC \
op monitor interval=29s role=Master \
op monitor interval=31s role=Slave \
op start timeout=240s interval=0 \
op promote timeout=90s interval=0 \
op demote timeout=90s interval=0 \
op notify timeout=90s interval=0 \
op stop timeout=100s interval=0
ms ms_2-1_drbd_OC res_2-1_drbd_OC \
meta master-max=1 master-node-max=1 target-role=Master clone-max=2 clone-node-max=1 notify=true
location loc_ms_2-1_drbd_OC_server103 ms_2-1_drbd_OC 0: server103
location loc_ms_2-1_drbd_OC_server104pref ms_2-1_drbd_OC 1: server104
集群确实进入良好状态:
* Clone Set: ms_2-1_drbd_OC [res_2-1_drbd_OC] (promotable):
* Promoted: [ server104 ]
* Unpromoted: [ server103 ]
之后,我可以在 server104 和server103 上resource maintenance ms_2-1_drbd_OC on
手动发出而不会出现问题,并且它会立即恢复到以前的状态,从而导致预期的状态消息(必须清理)。drbdadm secondary OC
drbdadm primay OC
resource maintenance ms_2-1_drbd_OC off
Failed Resource Actions:
* res_2-1_drbd_OC 31s-interval monitor on server103 returned 'promoted' at Tue Jun 20 17:40:36 2023 after 79ms
* res_2-1_drbd_OC 29s-interval monitor on server104 returned 'ok' at Tue Jun 20 17:40:36 2023 after 49ms
更改位置约束会导致集群立即发生切换。这确实双向起作用。
location loc_ms_2-1_drbd_OC_server103 ms_2-1_drbd_OC 10: server103
到目前为止一切顺利——我希望一切都能正常工作。但是,强制故障转移没有成功。从上面的良好状态,我发出了node standby server104
以下效果和一些信息journalctl -fxb
(遗漏了起搏器控制的 OK 消息):
1.) Pacemaker 尝试在 server103 上进行升级:
* Clone Set: ms_2-1_drbd_OC [res_2-1_drbd_OC] (promotable):
* res_2-1_drbd_OC (ocf:linbit:drbd): Promoting server103
* Stopped: [ server104 server105 ]
server104 上的 journalctl(第一秒):
kernel: block drbd1: role( Primary -> Secondary )
kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
kernel: drbd OC: peer( Secondary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown )
kernel: drbd OC: ack_receiver terminated
kernel: drbd OC: Terminating drbd_a_OC
kernel: drbd OC: Connection closed
kernel: drbd OC: conn( Disconnecting -> StandAlone )
kernel: drbd OC: receiver terminated
kernel: drbd OC: Terminating drbd_r_OC
kernel: block drbd1: disk( UpToDate -> Failed )
kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
kernel: block drbd1: disk( Failed -> Diskless )
kernel: drbd OC: Terminating drbd_w_OC
pacemaker-attrd[1336]: notice: Setting master-res_2-1_drbd_OC[server104]: 10000 -> (unset)
server103 上的 journalctl(第一秒,pacemaker-controld ok-messages 遗漏):
kernel: block drbd1: peer( Primary -> Secondary )
kernel: drbd OC: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown )
kernel: drbd OC: ack_receiver terminated
kernel: drbd OC: Terminating drbd_a_OC
kernel: drbd OC: Connection closed
kernel: drbd OC: conn( TearDown -> Unconnected )
kernel: drbd OC: receiver terminated
kernel: drbd OC: Restarting receiver thread
kernel: drbd OC: receiver (re)started
kernel: drbd OC: conn( Unconnected -> WFConnection )
pacemaker-attrd[1596]: notice: Setting master-res_2-1_drbd_OC[server104]: 10000 -> (unset)
crm-fence-peer.sh (...)
2.) 提升超时 90 秒后,Pacemaker 将使资源失效:
* Clone Set: ms_2-1_drbd_OC [res_2-1_drbd_OC] (promotable):
* res_2-1_drbd_OC (ocf:linbit:drbd): FAILED server103
* Stopped: [ server104 server105 ]
Failed Resource Actions:
* res_2-1_drbd_OC promote on server103 could not be executed (Timed Out) because 'Process did not exit within specified timeout'
server104 上的 journalctl(第 90 秒):
pacemaker-attrd[1336]: notice: Setting fail-count-res_2-1_drbd_OC#promote_0[server103]: (unset) -> 1
pacemaker-attrd[1336]: notice: Setting last-failure-res_2-1_drbd_OC#promote_0[server103]: (unset) -> 1687276647
pacemaker-attrd[1336]: notice: Setting master-res_2-1_drbd_OC[server103]: 1000 -> (unset)
pacemaker-attrd[1336]: notice: Setting master-res_2-1_drbd_OC[server103]: (unset) -> 10000
server103 上的 journalctl (第 90 秒至第 93 秒):
pacemaker-execd[1595]: warning: res_2-1_drbd_OC_promote_0[85862] timed out after 90000ms
pacemaker-controld[1598]: error: Result of promote operation for res_2-1_drbd_OC on server103: Timed Out after 1m30s (Process did not exit within specified timeout)
pacemaker-attrd[1596]: notice: Setting fail-count-res_2-1_drbd_OC#promote_0[server103]: (unset) -> 1
pacemaker-attrd[1596]: notice: Setting last-failure-res_2-1_drbd_OC#promote_0[server103]: (unset) -> 1687284510
crm-fence-peer.sh[85893]: INFO peer is not reachable, my disk is UpToDate: placed constraint 'drbd-fence-by-handler-OC-ms_2-1_drbd_OC'
kernel: drbd OC: helper command: /sbin/drbdadm fence-peer OC exit code 5 (0x500)
kernel: drbd OC: fence-peer helper returned 5 (peer is unreachable, assumed to be dead)
kernel: drbd OC: pdsk( DUnknown -> Outdated )
kernel: block drbd1: role( Secondary -> Primary )
kernel: block drbd1: new current UUID #1:2:3:4#
kernel: block drbd1: role( Primary -> Secondary )
kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
kernel: drbd OC: conn( WFConnection -> Disconnecting )
kernel: drbd OC: Discarding network configuration.
kernel: drbd OC: Connection closed
kernel: drbd OC: conn( Disconnecting -> StandAlone )
kernel: drbd OC: receiver terminated
kernel: drbd OC: Terminating drbd_r_OC
kernel: block drbd1: disk( UpToDate -> Failed )
kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
kernel: block drbd1: disk( Failed -> Diskless )
kernel: drbd OC: Terminating drbd_w_OC
pacemaker-attrd[1596]: notice: Setting master-res_2-1_drbd_OC[server103]: 1000 -> (unset)
pacemaker-controld[1598]: notice: Result of stop operation for res_2-1_drbd_OC on server103: ok
pacemaker-controld[1598]: notice: Requesting local execution of start operation for res_2-1_drbd_OC on server103
systemd-udevd[86992]: drbd1: Process '/usr/bin/unshare -m /usr/bin/snap auto-import --mount=/dev/drbd1' failed with exit code 1.
kernel: drbd OC: Starting worker thread (from drbdsetup-84 [87112])
kernel: block drbd1: disk( Diskless -> Attaching )
kernel: drbd OC: Method to ensure write ordering: flush
kernel: block drbd1: max BIO size = 1048576
kernel: block drbd1: drbd_bm_resize called with capacity == 3906131464
kernel: block drbd1: resync bitmap: bits=488266433 words=7629164 pages=14901
kernel: drbd1: detected capacity change from 0 to 3906131464
kernel: block drbd1: size = 1863 GB (1953065732 KB)
kernel: block drbd1: recounting of set bits took additional 3 jiffies
kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
kernel: block drbd1: disk( Attaching -> UpToDate ) pdsk( DUnknown -> Outdated )
kernel: block drbd1: attached to UUIDs #1:2:3:4#
kernel: drbd OC: conn( StandAlone -> Unconnected )
kernel: drbd OC: Starting receiver thread (from drbd_w_OC [87114])
kernel: drbd OC: receiver (re)started
kernel: drbd OC: conn( Unconnected -> WFConnection )
3.) 令人惊讶的是,Pacemaker 恢复了,并且确实推广了资源:
server104 上的 journalctl(第 113 秒):
pacemaker-attrd[1336]: notice: Setting master-res_2-1_drbd_OC[server103]: (unset) -> 10000
server103 上的 journalctl (第 98 秒至第 114 秒):
drbd(res_2-1_drbd_OC)[87174]: INFO: OC: Called drbdsetup wait-connect /dev/drbd_OC --wfc-timeout=5 --degr-wfc-timeout=5 --outdated-wfc-timeout=5
drbd(res_2-1_drbd_OC)[87178]: INFO: OC: Exit code 5
drbd(res_2-1_drbd_OC)[87182]: INFO: OC: Command output:
drbd(res_2-1_drbd_OC)[87186]: INFO: OC: Command stderr:
drbd(res_2-1_drbd_OC)[87217]: INFO: OC: Called drbdsetup wait-connect /dev/drbd_NC --wfc-timeout=5 --degr-wfc-timeout=5 --outdated-wfc-timeout=5
drbd(res_2-1_drbd_OC)[87221]: INFO: OC: Exit code 5
drbd(res_2-1_drbd_OC)[87225]: INFO: OC: Command output:
drbd(res_2-1_drbd_OC)[87229]: INFO: OC: Command stderr:
pacemaker-attrd[1596]: notice: Setting master-res_2-1_drbd_OC[server103]: (unset) -> 10000
kernel: block drbd1: role( Secondary -> Primary )
kernel: block drbd3: role( Secondary -> Primary )
kernel: block drbd3: new current UUID #1:2:3:4#
最后,资源确实进行了故障转移,但留下了错误:
* Clone Set: ms_2-1_drbd_OC [res_2-1_drbd_OC] (promotable):
* Promoted: [ server103 ]
* Stopped: [ server104 server105 ]
Failed Resource Actions:
* res_2-1_drbd_OC promote on server103 could not be executed (Timed Out) because 'Process did not exit within specified timeout' at Tue Jun 20 20:07:00 2023 after 1m30.001s
此外,在配置中添加了针对server103的位置约束:
location drbd-fence-by-handler-OC-ms_2-1_drbd_OC ms_2-1_drbd_OC rule $role=Master -inf: #uname ne server103
简历和我尝试过的事情
资源确实会自动进行故障转移,但必须等待看似不必要的超时,并且会留下必须手动清除的错误。
将 DRBD 资源的降级超时时间减少到 30 秒会导致整个过程失败且无法自我恢复。这对我来说毫无意义,因为手动切换是即时发生的。似乎在将资源切换为主要资源之前,promote 命令没有将其设为次要资源。然而,我似乎是唯一遇到这种奇怪行为的人。
我查阅了所有可用的信息,包括 Heartbeat 和不同版本的 Corosync、Pacemaker 和 DRBD。在升级系统时,网络连接方面存在很大问题,在跨越三个 LTS 版本时,我可能错过了一个关键问题。此外,我对 HA 技术并不十分熟悉。
我将非常感激您能指点我应该朝哪个方向看。很抱歉这篇帖子太长了!我希望您能浏览一下并找到相关信息。
答案1
好吧,这非常非常奇怪!这也解释了为什么只有我一个人遇到这个问题。感谢所有关注这个问题的人!
回答我自己的问题:这一定是由于一些不为人知的网络问题。如前所述,我的网络设置相当复杂,有一台大型服务器(103)和两台小型服务器(104 和 105——为简单起见,未提及后者)。每台小型服务器都通过两根背对背电缆连接到服务器 103,并通过以平衡循环方案将两者结合在一起进行通信。这一切都有效,我无法解释到底是什么造成了差异。
我解决问题的唯一办法是(重新?)在 server103 上应用 Netplan 配置并重新启动。这非常奇怪,因为每次启动时都会发生这种情况。然而,它确实成功了,故障转移几乎立即发生。这完全是瞎猜,而且能击中目标纯属运气。我一直在寻找任何形式的不对称,不知何故确实怀疑存在通信问题。在三个 LTS 版本的升级中,从 Ifupdown 到 Netplan 的过渡并不顺利。
之后,我还更改了 server10{4,5} 上的 Netplan 配置文件,其中每个接口(“以太网”)都配置了“激活模式:手动”。我将它们切换为空配置“{}”。现在故障转移非常顺利,并且在重新启动后仍然有效。在因为这个怪异的问题浪费了整整两个工作日之后,我非常高兴地将节点置于待机状态和退出待机状态,只是为了好玩。