我在 CentOS 上安装了 2 台 Asterisk 服务器,它们位于带有 peacemaker 的 HA 组中。集群处于主动被动模式,并配置了资源组。问题是,从某个时间开始,服务器从主服务器切换到辅助服务器,而我却没有明显的原因。日志以此开头:
PBX1 attrd[2503]: notice: attrd_perform_update: Sent update 327: pingd=0
随后停止资源组并
Oct 3 19:30:39 PBX1 Filesystem(DrbdFS)[14168]: INFO: Trying to unmount /drbd
Oct 3 19:30:39 PBX1 Filesystem(DrbdFS)[14168]: INFO: unmounted /drbd successfully
Oct 3 19:30:39 PBX1 crmd[2505]: notice: process_lrm_event: LRM operation DrbdFS_stop_0 (call=2046, rc=0, cib-update=2735, confirmed=true) ok
Oct 3 19:30:39 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 32: demote drbd_drbd0_demote_0 on pbx1(local)
Oct 3 19:30:39 PBX1 kernel: block drbd0: role( Primary -> Secondary )
Oct 3 19:30:39 PBX1 kernel: block drbd0: bitmap WRITE of 0 pages took 0 jiffies
Oct 3 19:30:39 PBX1 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Oct 3 19:30:39 PBX1 crmd[2505]: notice: process_lrm_event: LRM operation drbd_drbd0_demote_0 (call=2050, rc=0, cib-update=2736, confirmed=true) ok
Oct 3 19:30:39 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 159: notify drbd_drbd0_post_notify_demote_0 on pbx2
Oct 3 19:30:39 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 161: notify drbd_drbd0_post_notify_demote_0 on pbx1 (local)
Oct 3 19:30:39 PBX1 crmd[2505]: notice: process_lrm_event: LRM operation drbd_drbd0_notify_0 (call=2053, rc=0, cib-update=0, confirmed=true) ok
Oct 3 19:30:39 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 156: notify drbd_drbd0_pre_notify_stop_0 on pbx2
Oct 3 19:30:39 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 157: notify drbd_drbd0_pre_notify_stop_0 on pbx1 (local)
Oct 3 19:30:39 PBX1 crmd[2505]: notice: process_lrm_event: LRM operation drbd_drbd0_notify_0 (call=2056, rc=0, cib-update=0, confirmed=true) ok
Oct 3 19:30:39 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 31: stop drbd_drbd0_stop_0 on pbx2
Oct 3 19:30:39 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 33: stop drbd_drbd0_stop_0 on pbx1 (local)
Oct 3 19:30:39 PBX1 kernel: block drbd0: peer( Secondary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown )
Oct 3 19:30:39 PBX1 kernel: block drbd0: asender terminated
Oct 3 19:30:39 PBX1 kernel: block drbd0: Terminating drbd0_asender
Oct 3 19:30:39 PBX1 kernel: block drbd0: Connection closed
Oct 3 19:30:39 PBX1 kernel: block drbd0: conn( Disconnecting -> StandAlone )
Oct 3 19:30:39 PBX1 kernel: block drbd0: receiver terminated
Oct 3 19:30:39 PBX1 kernel: block drbd0: Terminating drbd0_receiver
Oct 3 19:30:39 PBX1 kernel: block drbd0: disk( UpToDate -> Failed )
Oct 3 19:30:39 PBX1 kernel: block drbd0: bitmap WRITE of 0 pages took 0 jiffies
Oct 3 19:30:39 PBX1 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Oct 3 19:30:39 PBX1 kernel: block drbd0: disk( Failed -> Diskless )
Oct 3 19:30:39 PBX1 kernel: block drbd0: drbd_bm_resize called with capacity == 0
Oct 3 19:30:39 PBX1 kernel: block drbd0: worker terminated
Oct 3 19:30:39 PBX1 kernel: block drbd0: Terminating drbd0_worker
Oct 3 19:30:39 PBX1 attrd[2503]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd_drbd0 (<null>)
Oct 3 19:30:39 PBX1 attrd[2503]: notice: attrd_perform_update: Sent delete 329: node=pbx1, attr=master-drbd_drbd0, id=<n/a>, set=(null), section=status
Oct 3 19:30:39 PBX1 crmd[2505]: notice: process_lrm_event: LRM operation drbd_drbd0_stop_0 (call=2059, rc=0, cib-update=2737, confirmed=true) ok
Oct 3 19:30:39 PBX1 crmd[2505]: notice: run_graph: Transition 2112 (Complete=63, Pending=0, Fired=0, Skipped=1, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3294.bz2): Stopped
Oct 3 19:30:39 PBX1 pengine[2504]: notice: unpack_config: On loss of CCM Quorum: Ignore
Oct 3 19:30:39 PBX1 pengine[2504]: notice: process_pe_message: Calculated Transition 2113: /var/lib/pacemaker/pengine/pe-input-3295.bz2
Oct 3 19:30:39 PBX1 crmd[2505]: notice: run_graph: Transition 2113 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3295.bz2): Complete
Oct 3 19:30:39 PBX1 crmd[2505]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Oct 3 19:30:41 PBX1 attrd[2503]: notice: attrd_perform_update: Sent update 333: pingd=1000
Oct 3 19:30:41 PBX1 crmd[2505]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Oct 3 19:30:41 PBX1 pengine[2504]: notice: unpack_config: On loss of CCM Quorum: Ignore
Oct 3 19:30:41 PBX1 pengine[2504]: notice: LogActions: Start drbd_drbd0:0#011(pbx2)
Oct 3 19:30:41 PBX1 pengine[2504]: notice: process_pe_message: Calculated Transition 2114: /var/lib/pacemaker/pengine/pe-input-3296.bz2
Oct 3 19:30:41 PBX1 pengine[2504]: notice: unpack_config: On loss of CCM Quorum: Ignore
Oct 3 19:30:41 PBX1 pengine[2504]: notice: LogActions: Start drbd_drbd0:0#011(pbx1)
Oct 3 19:30:41 PBX1 pengine[2504]: notice: LogActions: Start drbd_drbd0:1#011(pbx2)
Oct 3 19:30:41 PBX1 pengine[2504]: notice: process_pe_message: Calculated Transition 2115: /var/lib/pacemaker/pengine/pe-input-3297.bz2
Oct 3 19:30:41 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 7: start drbd_drbd0_start_0 on pbx1 (local)
Oct 3 19:30:41 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 9: start drbd_drbd0:1_start_0 on pbx2
Oct 3 19:30:41 PBX1 kernel: block drbd0: Starting worker thread (from cqueue [2046])
Oct 3 19:30:41 PBX1 kernel: block drbd0: disk( Diskless -> Attaching )
Oct 3 19:30:41 PBX1 kernel: block drbd0: Found 4 transactions (192 active extents) in activity log.
Oct 3 19:30:41 PBX1 kernel: block drbd0: Method to ensure write ordering: flush
Oct 3 19:30:41 PBX1 kernel: block drbd0: drbd_bm_resize called with capacity == 28411864
Oct 3 19:30:41 PBX1 kernel: block drbd0: resync bitmap: bits=3551483 words=55492 pages=109
Oct 3 19:30:41 PBX1 kernel: block drbd0: size = 14 GB (14205932 KB)
Oct 3 19:30:41 PBX1 kernel: block drbd0: bitmap READ of 109 pages took 13 jiffies
Oct 3 19:30:41 PBX1 kernel: block drbd0: recounting of set bits took additional 1 jiffies
Oct 3 19:30:41 PBX1 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Oct 3 19:30:41 PBX1 kernel: block drbd0: disk( Attaching -> UpToDate )
Oct 3 19:30:41 PBX1 kernel: block drbd0: attached to UUIDs 760BE94B0D6CDF8A:0000000000000000:819257FD75678FC9:819157FD75678FC9
Oct 3 19:30:41 PBX1 kernel: block drbd0: conn( StandAlone -> Unconnected )
Oct 3 19:30:41 PBX1 kernel: block drbd0: Starting receiver thread (from drbd0_worker [14408])
Oct 3 19:30:41 PBX1 kernel: block drbd0: receiver (re)started
Oct 3 19:30:41 PBX1 kernel: block drbd0: conn( Unconnected -> WFConnection )
Oct 3 19:30:41 PBX1 attrd[2503]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd_drbd0 (1000)
Oct 3 19:30:41 PBX1 attrd[2503]: notice: attrd_perform_update: Sent update 336: master-drbd_drbd0=1000
Oct 3 19:30:41 PBX1 attrd[2503]: notice: attrd_perform_update: Sent update 338: master-drbd_drbd0=1000
Oct 3 19:30:41 PBX1 crmd[2505]: notice: process_lrm_event: LRM operation drbd_drbd0_start_0 (call=2062, rc=0, cib-update=2741, confirmed=true) ok
Oct 3 19:30:41 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 106: notify drbd_drbd0_post_notify_start_0 on pbx1 (local)
Oct 3 19:30:41 PBX1 crmd[2505]: notice: te_rsc_command: Initiating action 107: notify drbd_drbd0:1_post_notify_start_0 on pbx2
Oct 3 19:30:41 PBX1 crmd[2505]: notice: process_lrm_event: LRM operation drbd_drbd0_notify_0 (call=2065, rc=0, cib-update=0, confirmed=true) ok
Oct 3 19:30:41 PBX1 crmd[2505]: notice: run_graph: Transition 2115 (Complete=10, Pending=0, Fired=0, Skipped=2, Incompl
pcs 配置中用于 ping 的部分是
Attributes: drbd_resource=drbd0
Operations: monitor interval=10s (drbd_drbd0-monitor-interval-10s)
Clone: Connectivity
Resource: p_ping (class=ocf provider=pacemaker type=ping)
Attributes: host_list="10.66.9.4 10.66.9.5 10.66.9.11 10.66.9.252 10.66.9.253 10.66.9.254" multiplier=1000 dampen=5s
Operations: monitor interval=1s (p_ping-monitor-interval-1s)
重要的信息是,由于一年前进行的重新设计,这些地址中只有 10.66.9.11 还有效(我打算清除无效地址,但这是另一个故事),但问题最多持续一两个月。可能是因为这个原因,或者如果您有其他想法,请告诉我。 PS 电缆良好,并且根据为监控配置的 cacti,到 10.66.9.11 接口的 icmp 没有丢失
Oct 3 19:18:31 PBX1 asterisk(asterisk)[1338]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:19:01 PBX1 asterisk(asterisk)[1960]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:19:14 PBX1 Xinetd(Xinetd)[2200]: INFO: checking "disable" in /etc/xinetd.d/tftp
Oct 3 19:19:32 PBX1 asterisk(asterisk)[2347]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:20:02 PBX1 asterisk(asterisk)[2994]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:20:28 PBX1 dhcpd: isc-dhcpd-4.1.1-P1
Oct 3 19:20:32 PBX1 asterisk(asterisk)[3369]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:21:02 PBX1 asterisk(asterisk)[3970]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:21:08 PBX1 crmd[2505]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
Oct 3 19:21:08 PBX1 pengine[2504]: notice: unpack_config: On loss of CCM Quorum: Ignore
Oct 3 19:21:08 PBX1 pengine[2504]: notice: process_pe_message: Calculated Transition 2110: /var/lib/pacemaker/pengine/pe-input-3292.bz2
Oct 3 19:21:08 PBX1 crmd[2505]: notice: run_graph: Transition 2110 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3292.bz2): Complete
Oct 3 19:21:08 PBX1 crmd[2505]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Oct 3 19:21:14 PBX1 Xinetd(Xinetd)[4176]: INFO: checking "disable" in /etc/xinetd.d/tftp
Oct 3 19:21:32 PBX1 asterisk(asterisk)[4318]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:22:02 PBX1 asterisk(asterisk)[4989]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:22:28 PBX1 dhcpd: isc-dhcpd-4.1.1-P1
Oct 3 19:22:32 PBX1 asterisk(asterisk)[5352]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:23:02 PBX1 asterisk(asterisk)[5939]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:23:14 PBX1 Xinetd(Xinetd)[6160]: INFO: checking "disable" in /etc/xinetd.d/tftp
Oct 3 19:23:32 PBX1 asterisk(asterisk)[6302]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:24:02 PBX1 asterisk(asterisk)[6933]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:24:28 PBX1 dhcpd: isc-dhcpd-4.1.1-P1
Oct 3 19:24:32 PBX1 asterisk(asterisk)[7276]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:25:02 PBX1 asterisk(asterisk)[7895]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:25:14 PBX1 Xinetd(Xinetd)[8118]: INFO: checking "disable" in /etc/xinetd.d/tftp
Oct 3 19:25:32 PBX1 asterisk(asterisk)[8259]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns.isc.afilias-nst.info/A/IN': 2a01:8840:6::1#53
Oct 3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns.isc.afilias-nst.info/AAAA/IN': 2a01:8840:6::1#53
Oct 3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns1.isc.ultradns.net/A/IN': 2001:dcd:4::2#53
Oct 3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns2.isc.ultradns.net/A/IN': 2001:dcd:4::2#53
Oct 3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns1.isc.ultradns.net/AAAA/IN': 2001:dcd:4::2#53
Oct 3 19:25:37 PBX1 named[27202]: error (network unreachable) resolving 'ns2.isc.ultradns.net/AAAA/IN': 2001:dcd:4::2#53
Oct 3 19:26:02 PBX1 asterisk(asterisk)[8873]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:26:28 PBX1 dhcpd: isc-dhcpd-4.1.1-P1
Oct 3 19:26:32 PBX1 asterisk(asterisk)[9233]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:27:02 PBX1 asterisk(asterisk)[9841]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:27:14 PBX1 Xinetd(Xinetd)[10088]: INFO: checking "disable" in /etc/xinetd.d/tftp
Oct 3 19:27:32 PBX1 asterisk(asterisk)[10230]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:28:02 PBX1 asterisk(asterisk)[10862]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:28:28 PBX1 dhcpd: isc-dhcpd-4.1.1-P1
Oct 3 19:28:32 PBX1 asterisk(asterisk)[11224]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:29:03 PBX1 asterisk(asterisk)[11810]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:29:14 PBX1 Xinetd(Xinetd)[12031]: INFO: checking "disable" in /etc/xinetd.d/tftp
Oct 3 19:29:33 PBX1 asterisk(asterisk)[12174]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:30:03 PBX1 asterisk(asterisk)[12818]: INFO: 0 active channels 0 active calls 5999 calls processed
Oct 3 19:30:23 PBX1 attrd[2503]: notice: attrd_perform_update: Sent update 327: pingd=0
Oct 3 19:30:23 PBX1 crmd[2505]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Oct 3 19:30:23 PBX1 pengine[2504]: notice: unpack_config: On loss of CCM Quorum: Ignore
Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop drbd_drbd0:0#011(pbx2..bg)
Oct 3 19:30:23 PBX1 pengine[2504]: notice: process_pe_message: Calculated Transition 2111: /var/lib/pacemaker/pengine/pe-input-3293.bz2
Oct 3 19:30:23 PBX1 pengine[2504]: notice: unpack_config: On loss of CCM Quorum: Ignore
Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop drbd_drbd0:0#011(pbx2..bg)
Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Demote drbd_drbd0:1#011(Master -> Stopped pbx1)
Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop ClusPBXIP#011(pbx1)
Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop ClusNetIP#011(pbx1)
Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop ClusPBXIP-R#011(pbx1)
Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop ClusPBXIP-R2#011(pbx1)
Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop DSKPBXIP-R#011(pbx1)
Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop Fixr_ClusPBXIP#011(pbx1)
Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop DSKPBXIP-R5#011(pbx1)
Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop DSKPBXIP-R6#011(pbx1)
Oct 3 19:30:23 PBX1 pengine[2504]: notice: LogActions: Stop DSKPBXIP-R7#011(pbx1)
答案1
这PBX1 attrd[2503]: notice: attrd_perform_update: Sent update 327: pingd=0
有点暴露了。这里的 pingd 分数是 0(没有 ping)。然后它显示它停止了所有资源(10 月 3 日 19:30:23)。
稍后我们看到 pingd 分数达到 1000,然后服务再次重新启动:PBX1 attrd[2503]: notice: attrd_perform_update: Sent update 333: pingd=1000
(10 月 3 日 19:30:41)。
我没有完整的 Pacemaker CIB,但我怀疑您有一些位置/主机托管规则,规定资源无法在 pingd 分数等于 0 的地方运行。因此,当您的 pingd 分数达到零时,资源将在整个集群范围内停止,然后在 ping 再次返回时重新启动。虽然资源可能在对等节点上启动,但我认为这不是故障转移的问题,而是由于 pingd 资源和规则导致服务停止和重新启动。
虽然您确实提到 Cacti 中的 ping 没有问题:
PS 电缆良好,并且根据为监控配置的 cacti,到 10.66.9.11 接口的 icmp 没有丢失
cacti 多久监控一次?如果 ping 连续 5 秒未返回,Pacemaker 就会做出反应。此外,cacti 数据可能只是从 cacti 服务器到 10.66.9.11 服务器的 ping。这并不一定意味着 PBX 集群和 10.66.9.11 系统之间没有问题。
我建议要么向 host_list 添加更多 ping 节点,要么增加 p_ping 的监控间隔和阻尼时间,要么完全取消 p_ping 资源和规则。