使用 pacemaker 和 corosync 的 systemd 资源监视器在克隆时返回“未运行”

2024-5-30 • tag-icon

使用 pacemaker 和 corosync 的 systemd 资源监视器在克隆时返回“未运行”

设置： - 操作系统：CentOS 7，Corosync、Pacemaker 和 PCS 的最新版本 - 具有虚拟 IP 的双节点主动/主动集群 - 在两个节点上，Exim 都在运行以发送邮件（SMTP），配置中没有什么特殊之处 - 当 Exim 在其中一个节点上发生故障时，该节点不应再参与回复虚拟 IP，直到 Exim 重新启动并运行

我尝试让它工作的方式： - 克隆虚拟 IP 的 ocf:heartbeat:IPaddr2 资源 - 克隆 systemd:exim 资源以使用 on-fail="standby" 选项监视 Exim

问题：最初，一切都按预期运行。当其中一个节点无法运行 Exim 时，它会正确停止，并且该节点不再参与虚拟 IP。问题是，在停止并启动其中一个节点后，Exim 会重新启动（应该如此），但监视器返回“未运行”。当 Exim-resource 未配置 on-fail="standby" 时，一切都按预期运行，我可以根据需要启动/停止 Exim 和其中一个节点。

日志中的消息：

Jan 28 16:17:30 testvm101 crmd[14183]: notice: process_lrm_event: LRM operation exim:0_monitor_30000 (call=141, rc=7, cib-update=211, confirmed=false) not running
Jan 28 16:17:30 testvm101 crmd[14183]: warning: status_from_rc: Action 20 (exim:0_monitor_30000) on testvm101 failed (target: 0 vs. rc: 7): Error
Jan 28 16:17:30 testvm101 crmd[14183]: warning: update_failcount: Updating failcount for exim:0 on testvm101 after failed monitor: rc=7 (update=value++, time=1422458250)

pcs状态的输出：

[root@testvm101 ~]# pcs status
Cluster name: smtp_cluster
Last updated: Wed Jan 28 16:31:44 2015
Last change: Wed Jan 28 16:17:13 2015 via cibadmin on testvm101
Stack: corosync
Current DC: testvm101 (1) - partition with quorum
Version: 1.1.10-32.el7_0.1-368c726
2 Nodes configured
4 Resources configured


Node testvm101 (1): standby (on-fail)
Online: [ testvm102 ]

Full list of resources:

 Clone Set: virtual_ip-clone [virtual_ip] (unique)
     virtual_ip:0       (ocf::heartbeat:IPaddr2):       Started testvm102
     virtual_ip:1       (ocf::heartbeat:IPaddr2):       Started testvm102
 Clone Set: exim-clone [exim] (unique)
     exim:0     (systemd:exim): Started testvm102
     exim:1     (systemd:exim): Started testvm102

Failed actions:
    exim:0_monitor_30000 on testvm101 'not running' (7): call=141, status=complete, last-rc-change='Wed Jan 28 16:17:30 2015', queued=6ms, exec=15002ms

据我所知，在收到消息时，Exim 正在运行并为 systemd 工作。我已经尝试指定启动延迟选项，希望这会有所不同（但事实并非如此）。

运行时：pcs resource cleanup exim-clone清除失败计数，一切正常，直到监视操作第一次出现，然后标记为待机的节点由另一个节点切换......

示例：节点 testvm102 上的 Exim 监控失败后的状态：

[root@testvm101 ~]# pcs status
...
Node testvm102 (2): standby (on-fail)
Online: [ testvm101 ]

Full list of resources:

 Clone Set: virtual_ip-clone [virtual_ip] (unique)
     virtual_ip:0       (ocf::heartbeat:IPaddr2):       Started testvm101
     virtual_ip:1       (ocf::heartbeat:IPaddr2):       Started testvm101
 Clone Set: exim-clone [exim] (unique)
     exim:0     (systemd:exim): Started testvm101
     exim:1     (systemd:exim): Started testvm101

Failed actions:
    exim:0_monitor_30000 on testvm102 'not running' (7): call=150, status=complete, last-rc-change='Wed Jan 28 16:33:59 2015', queued=5ms, exec=15004ms

我正在运行 exim-resource 的资源清理来重置失败计数：

[root@testvm101 ~]# pcs resource cleanup exim-clone
Resource: exim-clone successfully cleaned up

经过短暂的时间后，状态看起来不错（实际上也运行良好）：

[root@testvm101 ~]# pcs status
...
Online: [ testvm101 testvm102 ]

Full list of resources:

 Clone Set: virtual_ip-clone [virtual_ip] (unique)
     virtual_ip:0       (ocf::heartbeat:IPaddr2):       Started testvm101
     virtual_ip:1       (ocf::heartbeat:IPaddr2):       Started testvm102
 Clone Set: exim-clone [exim] (unique)
     exim:0     (systemd:exim): Started testvm101
     exim:1     (systemd:exim): Started testvm102

下次执行监视操作时，另一个节点上的检查失败：

[root@testvm101 ~]# pcs status
...
Node testvm101 (1): standby (on-fail)
Online: [ testvm102 ]

Full list of resources:

 Clone Set: virtual_ip-clone [virtual_ip] (unique)
     virtual_ip:0       (ocf::heartbeat:IPaddr2):       Started testvm102
     virtual_ip:1       (ocf::heartbeat:IPaddr2):       Started testvm102
 Clone Set: exim-clone [exim] (unique)
     exim:0     (systemd:exim): Started testvm102
     exim:1     (systemd:exim): Started testvm102

Failed actions:
    exim:0_monitor_30000 on testvm101 'not running' (7): call=176, status=complete, last-rc-change='Wed Jan 28 16:37:10 2015', queued=0ms, exec=0ms

也许我忘记了什么？

感谢帮助

相关内容