Pacemaker 集群无法干净地对 DRBD 资源进行故障转移(但可以手动进行)

Pacemaker 集群无法干净地对 DRBD 资源进行故障转移(但可以手动进行)

我不得不从 Ubuntu 16.04 升级一个集群。它在 18.04 和 20.04 上工作正常,但现在在 22.04 上它不会对 DRBD 设备进行故障转移。将资源置于维护模式并执行手动drbdadm secondary/primary操作可以立即生效,不会出现问题。但是,当将一个节点置于待机状态时,资源会失败并被隔离。

这发生在装有 Pacemaker 2.1.2 和 Corosync 3.1.16 的 Ubuntu 22.04.2 LTS 上。内核模块的 DRBD 版本为 8.4.11,drbd-utils 的版本为 9.15.0。DRBD 和 Corosync 的配置文件不包含任何有趣的内容。我用它来crmsh管理集群。

我可以使用以下相关设置将情况简化为双节点设置。

node 103: server103
node 104: server104
primitive res_2-1_drbd_OC ocf:linbit:drbd \
        params drbd_resource=OC \
        op monitor interval=29s role=Master \
        op monitor interval=31s role=Slave \
        op start timeout=240s interval=0 \
        op promote timeout=90s interval=0 \
        op demote timeout=90s interval=0 \
        op notify timeout=90s interval=0 \
        op stop timeout=100s interval=0
ms ms_2-1_drbd_OC res_2-1_drbd_OC \
        meta master-max=1 master-node-max=1 target-role=Master clone-max=2 clone-node-max=1 notify=true
location loc_ms_2-1_drbd_OC_server103 ms_2-1_drbd_OC 0: server103
location loc_ms_2-1_drbd_OC_server104pref ms_2-1_drbd_OC 1: server104

集群确实进入良好状态:

  * Clone Set: ms_2-1_drbd_OC [res_2-1_drbd_OC] (promotable):
    * Promoted: [ server104 ]
    * Unpromoted: [ server103 ]

之后,我可以在 server104 和server103 上resource maintenance ms_2-1_drbd_OC on手动发出而不会出现问题,并且它会立即恢复到以前的状态,从而导致预期的状态消息(必须清理)。drbdadm secondary OCdrbdadm primay OCresource maintenance ms_2-1_drbd_OC off

Failed Resource Actions:
  * res_2-1_drbd_OC 31s-interval monitor on server103 returned 'promoted' at Tue Jun 20 17:40:36 2023 after 79ms
  * res_2-1_drbd_OC 29s-interval monitor on server104 returned 'ok' at Tue Jun 20 17:40:36 2023 after 49ms

更改位置约束会导致集群立即发生切换。这确实双向起作用。

location loc_ms_2-1_drbd_OC_server103 ms_2-1_drbd_OC 10: server103

到目前为止一切顺利——我希望一切都能正常工作。但是,强制故障转移没有成功。从上面的良好状态,我发出了node standby server104以下效果和一些信息journalctl -fxb(遗漏了起搏器控制的 OK 消息):

1.) Pacemaker 尝试在 server103 上进行升级:

  * Clone Set: ms_2-1_drbd_OC [res_2-1_drbd_OC] (promotable):
    * res_2-1_drbd_OC   (ocf:linbit:drbd):       Promoting server103
    * Stopped: [ server104 server105 ]

server104 上的 journalctl(第一秒):

kernel: block drbd1: role( Primary -> Secondary ) 
kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
kernel: drbd OC: peer( Secondary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown ) 
kernel: drbd OC: ack_receiver terminated
kernel: drbd OC: Terminating drbd_a_OC
kernel: drbd OC: Connection closed
kernel: drbd OC: conn( Disconnecting -> StandAlone ) 
kernel: drbd OC: receiver terminated
kernel: drbd OC: Terminating drbd_r_OC
kernel: block drbd1: disk( UpToDate -> Failed ) 
kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
kernel: block drbd1: disk( Failed -> Diskless ) 
kernel: drbd OC: Terminating drbd_w_OC
pacemaker-attrd[1336]:  notice: Setting master-res_2-1_drbd_OC[server104]: 10000 -> (unset)

server103 上的 journalctl(第一秒,pacemaker-controld ok-messages 遗漏):

kernel: block drbd1: peer( Primary -> Secondary ) 
kernel: drbd OC: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown )
kernel: drbd OC: ack_receiver terminated
kernel: drbd OC: Terminating drbd_a_OC
kernel: drbd OC: Connection closed
kernel: drbd OC: conn( TearDown -> Unconnected ) 
kernel: drbd OC: receiver terminated
kernel: drbd OC: Restarting receiver thread
kernel: drbd OC: receiver (re)started
kernel: drbd OC: conn( Unconnected -> WFConnection ) 
pacemaker-attrd[1596]:  notice: Setting master-res_2-1_drbd_OC[server104]: 10000 -> (unset)
crm-fence-peer.sh (...)

2.) 提升超时 90 秒后,Pacemaker 将使资源失效:

  * Clone Set: ms_2-1_drbd_OC [res_2-1_drbd_OC] (promotable):
    * res_2-1_drbd_OC   (ocf:linbit:drbd):       FAILED server103
    * Stopped: [ server104 server105 ]
Failed Resource Actions:
  * res_2-1_drbd_OC promote on server103 could not be executed (Timed Out) because 'Process did not exit within specified timeout'

server104 上的 journalctl(第 90 秒):

pacemaker-attrd[1336]:  notice: Setting fail-count-res_2-1_drbd_OC#promote_0[server103]: (unset) -> 1
pacemaker-attrd[1336]:  notice: Setting last-failure-res_2-1_drbd_OC#promote_0[server103]: (unset) -> 1687276647
pacemaker-attrd[1336]:  notice: Setting master-res_2-1_drbd_OC[server103]: 1000 -> (unset)
pacemaker-attrd[1336]:  notice: Setting master-res_2-1_drbd_OC[server103]: (unset) -> 10000

server103 上的 journalctl (第 90 秒至第 93 秒):

pacemaker-execd[1595]:  warning: res_2-1_drbd_OC_promote_0[85862] timed out after 90000ms
pacemaker-controld[1598]:  error: Result of promote operation for res_2-1_drbd_OC on server103: Timed Out after 1m30s (Process did not exit within specified timeout)
pacemaker-attrd[1596]:  notice: Setting fail-count-res_2-1_drbd_OC#promote_0[server103]: (unset) -> 1
pacemaker-attrd[1596]:  notice: Setting last-failure-res_2-1_drbd_OC#promote_0[server103]: (unset) -> 1687284510
crm-fence-peer.sh[85893]: INFO peer is not reachable, my disk is UpToDate: placed constraint 'drbd-fence-by-handler-OC-ms_2-1_drbd_OC'
kernel: drbd OC: helper command: /sbin/drbdadm fence-peer OC exit code 5 (0x500)
kernel: drbd OC: fence-peer helper returned 5 (peer is unreachable, assumed to be dead)
kernel: drbd OC: pdsk( DUnknown -> Outdated ) 
kernel: block drbd1: role( Secondary -> Primary ) 
kernel: block drbd1: new current UUID #1:2:3:4#
kernel: block drbd1: role( Primary -> Secondary ) 
kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
kernel: drbd OC: conn( WFConnection -> Disconnecting ) 
kernel: drbd OC: Discarding network configuration.
kernel: drbd OC: Connection closed
kernel: drbd OC: conn( Disconnecting -> StandAlone ) 
kernel: drbd OC: receiver terminated
kernel: drbd OC: Terminating drbd_r_OC
kernel: block drbd1: disk( UpToDate -> Failed ) 
kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
kernel: block drbd1: disk( Failed -> Diskless ) 
kernel: drbd OC: Terminating drbd_w_OC
pacemaker-attrd[1596]:  notice: Setting master-res_2-1_drbd_OC[server103]: 1000 -> (unset)
pacemaker-controld[1598]:  notice: Result of stop operation for res_2-1_drbd_OC on server103: ok
pacemaker-controld[1598]:  notice: Requesting local execution of start operation for res_2-1_drbd_OC on server103
systemd-udevd[86992]: drbd1: Process '/usr/bin/unshare -m /usr/bin/snap auto-import --mount=/dev/drbd1' failed with exit code 1.
kernel: drbd OC: Starting worker thread (from drbdsetup-84 [87112])
kernel: block drbd1: disk( Diskless -> Attaching ) 
kernel: drbd OC: Method to ensure write ordering: flush
kernel: block drbd1: max BIO size = 1048576
kernel: block drbd1: drbd_bm_resize called with capacity == 3906131464
kernel: block drbd1: resync bitmap: bits=488266433 words=7629164 pages=14901
kernel: drbd1: detected capacity change from 0 to 3906131464
kernel: block drbd1: size = 1863 GB (1953065732 KB)
kernel: block drbd1: recounting of set bits took additional 3 jiffies
kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
kernel: block drbd1: disk( Attaching -> UpToDate ) pdsk( DUnknown -> Outdated ) 
kernel: block drbd1: attached to UUIDs #1:2:3:4#
kernel: drbd OC: conn( StandAlone -> Unconnected ) 
kernel: drbd OC: Starting receiver thread (from drbd_w_OC [87114])
kernel: drbd OC: receiver (re)started
kernel: drbd OC: conn( Unconnected -> WFConnection ) 

3.) 令人惊讶的是,Pacemaker 恢复了,并且确实推广了资源:

server104 上的 journalctl(第 113 秒):

pacemaker-attrd[1336]:  notice: Setting master-res_2-1_drbd_OC[server103]: (unset) -> 10000

server103 上的 journalctl (第 98 秒至第 114 秒):

drbd(res_2-1_drbd_OC)[87174]: INFO: OC: Called drbdsetup wait-connect /dev/drbd_OC --wfc-timeout=5 --degr-wfc-timeout=5 --outdated-wfc-timeout=5
drbd(res_2-1_drbd_OC)[87178]: INFO: OC: Exit code 5
drbd(res_2-1_drbd_OC)[87182]: INFO: OC: Command output:
drbd(res_2-1_drbd_OC)[87186]: INFO: OC: Command stderr:
drbd(res_2-1_drbd_OC)[87217]: INFO: OC: Called drbdsetup wait-connect /dev/drbd_NC --wfc-timeout=5 --degr-wfc-timeout=5 --outdated-wfc-timeout=5
drbd(res_2-1_drbd_OC)[87221]: INFO: OC: Exit code 5
drbd(res_2-1_drbd_OC)[87225]: INFO: OC: Command output:
drbd(res_2-1_drbd_OC)[87229]: INFO: OC: Command stderr:
pacemaker-attrd[1596]:  notice: Setting master-res_2-1_drbd_OC[server103]: (unset) -> 10000
kernel: block drbd1: role( Secondary -> Primary ) 
kernel: block drbd3: role( Secondary -> Primary ) 
kernel: block drbd3: new current UUID #1:2:3:4#

最后,资源确实进行了故障转移,但留下了错误:

  * Clone Set: ms_2-1_drbd_OC [res_2-1_drbd_OC] (promotable):
    * Promoted: [ server103 ]
    * Stopped: [ server104 server105 ]
Failed Resource Actions:
  * res_2-1_drbd_OC promote on server103 could not be executed (Timed Out) because 'Process did not exit within specified timeout' at Tue Jun 20 20:07:00 2023 after 1m30.001s

此外,在配置中添加了针对server103的位置约束:

location drbd-fence-by-handler-OC-ms_2-1_drbd_OC ms_2-1_drbd_OC rule $role=Master -inf: #uname ne server103

简历和我尝试过的事情

资源确实会自动进行故障转移,但必须等待看似不必要的超时,并且会留下必须手动清除的错误。

将 DRBD 资源的降级超时时间减少到 30 秒会导致整个过程失败且无法自我恢复。这对我来说毫无意义,因为手动切换是即时发生的。似乎在将资源切换为主要资源之前,promote 命令没有将其设为次要资源。然而,我似乎是唯一遇到这种奇怪行为的人。

我查阅了所有可用的信息,包括 Heartbeat 和不同版本的 Corosync、Pacemaker 和 DRBD。在升级系统时,网络连接方面存在很大问题,在跨越三个 LTS 版本时,我可能错过了一个关键问题。此外,我对 HA 技术并不十分熟悉。

我将非常感激您能指点我应该朝哪个方向看。很抱歉这篇帖子太长了!我希望您能浏览一下并找到相关信息。

答案1

好吧,这非常非常奇怪!这也解释了为什么只有我一个人遇到这个问题。感谢所有关注这个问题的人!

回答我自己的问题:这一定是由于一些不为人知的网络问题。如前所述,我的网络设置相当复杂,有一台大型服务器(103)和两台小型服务器(104 和 105——为简单起见,未提及后者)。每台小型服务器都通过两根背对背电缆连接到服务器 103,并通过以平衡循环方案将两者结合在一起进行通信。这一切都有效,我无法解释到底是什么造成了差异。

我解决问题的唯一办法是(重新?)在 server103 上应用 Netplan 配置并重新启动。这非常奇怪,因为每次启动时都会发生这种情况。然而,它确实成功了,故障转移几乎立即发生。这完全是瞎猜,而且能击中目标纯属运气。我一直在寻找任何形式的不对称,不知何故确实怀疑存在通信问题。在三个 LTS 版本的升级中,从 Ifupdown 到 Netplan 的过渡并不顺利。

之后,我还更改了 server10{4,5} 上的 Netplan 配置文件,其中每个接口(“以太网”)都配置了“激活模式:手动”。我将它们切换为空配置“{}”。现在故障转移非常顺利,并且在重新启动后仍然有效。在因为这个怪异的问题浪费了整整两个工作日之后,我非常高兴地将节点置于待机状态和退出待机状态,只是为了好玩。

相关内容