设置: - 操作系统:CentOS 7,Corosync、Pacemaker 和 PCS 的最新版本 - 具有虚拟 IP 的双节点主动/主动集群 - 在两个节点上,Exim 都在运行以发送邮件(SMTP),配置中没有什么特殊之处 - 当 Exim 在其中一个节点上发生故障时,该节点不应再参与回复虚拟 IP,直到 Exim 重新启动并运行
我尝试让它工作的方式: - 克隆虚拟 IP 的 ocf:heartbeat:IPaddr2 资源 - 克隆 systemd:exim 资源以使用 on-fail="standby" 选项监视 Exim
问题:最初,一切都按预期运行。当其中一个节点无法运行 Exim 时,它会正确停止,并且该节点不再参与虚拟 IP。问题是,在停止并启动其中一个节点后,Exim 会重新启动(应该如此),但监视器返回“未运行”。当 Exim-resource 未配置 on-fail="standby" 时,一切都按预期运行,我可以根据需要启动/停止 Exim 和其中一个节点。
日志中的消息:
Jan 28 16:17:30 testvm101 crmd[14183]: notice: process_lrm_event: LRM operation exim:0_monitor_30000 (call=141, rc=7, cib-update=211, confirmed=false) not running
Jan 28 16:17:30 testvm101 crmd[14183]: warning: status_from_rc: Action 20 (exim:0_monitor_30000) on testvm101 failed (target: 0 vs. rc: 7): Error
Jan 28 16:17:30 testvm101 crmd[14183]: warning: update_failcount: Updating failcount for exim:0 on testvm101 after failed monitor: rc=7 (update=value++, time=1422458250)
pcs状态的输出:
[root@testvm101 ~]# pcs status
Cluster name: smtp_cluster
Last updated: Wed Jan 28 16:31:44 2015
Last change: Wed Jan 28 16:17:13 2015 via cibadmin on testvm101
Stack: corosync
Current DC: testvm101 (1) - partition with quorum
Version: 1.1.10-32.el7_0.1-368c726
2 Nodes configured
4 Resources configured
Node testvm101 (1): standby (on-fail)
Online: [ testvm102 ]
Full list of resources:
Clone Set: virtual_ip-clone [virtual_ip] (unique)
virtual_ip:0 (ocf::heartbeat:IPaddr2): Started testvm102
virtual_ip:1 (ocf::heartbeat:IPaddr2): Started testvm102
Clone Set: exim-clone [exim] (unique)
exim:0 (systemd:exim): Started testvm102
exim:1 (systemd:exim): Started testvm102
Failed actions:
exim:0_monitor_30000 on testvm101 'not running' (7): call=141, status=complete, last-rc-change='Wed Jan 28 16:17:30 2015', queued=6ms, exec=15002ms
据我所知,在收到消息时,Exim 正在运行并为 systemd 工作。我已经尝试指定启动延迟选项,希望这会有所不同(但事实并非如此)。
运行时:pcs resource cleanup exim-clone
清除失败计数,一切正常,直到监视操作第一次出现,然后标记为待机的节点由另一个节点切换......
示例:节点 testvm102 上的 Exim 监控失败后的状态:
[root@testvm101 ~]# pcs status
...
Node testvm102 (2): standby (on-fail)
Online: [ testvm101 ]
Full list of resources:
Clone Set: virtual_ip-clone [virtual_ip] (unique)
virtual_ip:0 (ocf::heartbeat:IPaddr2): Started testvm101
virtual_ip:1 (ocf::heartbeat:IPaddr2): Started testvm101
Clone Set: exim-clone [exim] (unique)
exim:0 (systemd:exim): Started testvm101
exim:1 (systemd:exim): Started testvm101
Failed actions:
exim:0_monitor_30000 on testvm102 'not running' (7): call=150, status=complete, last-rc-change='Wed Jan 28 16:33:59 2015', queued=5ms, exec=15004ms
我正在运行 exim-resource 的资源清理来重置失败计数:
[root@testvm101 ~]# pcs resource cleanup exim-clone
Resource: exim-clone successfully cleaned up
经过短暂的时间后,状态看起来不错(实际上也运行良好):
[root@testvm101 ~]# pcs status
...
Online: [ testvm101 testvm102 ]
Full list of resources:
Clone Set: virtual_ip-clone [virtual_ip] (unique)
virtual_ip:0 (ocf::heartbeat:IPaddr2): Started testvm101
virtual_ip:1 (ocf::heartbeat:IPaddr2): Started testvm102
Clone Set: exim-clone [exim] (unique)
exim:0 (systemd:exim): Started testvm101
exim:1 (systemd:exim): Started testvm102
下次执行监视操作时,另一个节点上的检查失败:
[root@testvm101 ~]# pcs status
...
Node testvm101 (1): standby (on-fail)
Online: [ testvm102 ]
Full list of resources:
Clone Set: virtual_ip-clone [virtual_ip] (unique)
virtual_ip:0 (ocf::heartbeat:IPaddr2): Started testvm102
virtual_ip:1 (ocf::heartbeat:IPaddr2): Started testvm102
Clone Set: exim-clone [exim] (unique)
exim:0 (systemd:exim): Started testvm102
exim:1 (systemd:exim): Started testvm102
Failed actions:
exim:0_monitor_30000 on testvm101 'not running' (7): call=176, status=complete, last-rc-change='Wed Jan 28 16:37:10 2015', queued=0ms, exec=0ms
也许我忘记了什么?
感谢帮助