我在两台服务器上使用 Heartbeat 为 HAProxy 设置了主服务器故障转移。一段时间以来,它一直运行顺利。今天,我们的服务停机了几分钟,因为辅助服务器认为主服务器已停机。它试图接管共享 IP,但无法成功,因为主服务器仍在使用该 IP。但是,根据日志,主服务器似乎一直在与辅助服务器就接管事宜进行沟通,因此这毫无意义。
通过在主服务器上重新启动 Heartbeat 解决问题后,我注意到主服务器上的时间与辅助服务器上的时间相差了约 5 分钟。Heartbeat 是否使用时间来区分资源是否已关闭?
ServerNode1 系统日志:
Jun 8 14:25:51 serverNode heartbeat: [15461]: ERROR: Both machines own our resources!
Jun 8 14:25:52 serverNode heartbeat: [15461]: ERROR: Both machines own our resources!
Jun 8 14:25:56 serverNode heartbeat: [15461]: info: Received shutdown notice from 'serverNode2'.
Jun 8 14:25:56 serverNode heartbeat: [15461]: info: Resources being acquired from serverNode2.
Jun 8 14:25:56 serverNode heartbeat: [15461]: debug: StartNextRemoteRscReq(): child count 1
Jun 8 14:25:56 serverNode heartbeat: [18058]: info: acquire local HA resources (standby).
Jun 8 14:25:56 serverNode ResourceManager[18087]: info: Acquiring resource group: serverNode xxx.xxx.xxx.88
Jun 8 14:25:56 serverNode IPaddr[18124]: INFO: Running OK
Jun 8 14:25:56 serverNode IPaddr[18138]: INFO: Running OK
Jun 8 14:25:56 serverNode heartbeat: [18059]: info: Local Resource acquisition completed.
Jun 8 14:25:56 serverNode heartbeat: [18058]: info: local HA resource acquisition completed (standby).
Jun 8 14:25:56 serverNode heartbeat: [15461]: info: Standby resource acquisition done [foreign].
Jun 8 14:25:56 serverNode heartbeat: [15461]: debug: StartNextRemoteRscReq(): child count 1
Jun 8 14:25:56 serverNode heartbeat: [18184]: debug: notify_world: setting SIGCHLD Handler to SIG_DFL
Jun 8 14:25:56 serverNode harc[18184]: info: Running /etc/ha.d//rc.d/status status
Jun 8 14:25:56 serverNode mach_down[18199]: info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired
Jun 8 14:25:56 serverNode mach_down[18199]: info: mach_down takeover complete for node serverNode2.
Jun 8 14:25:56 serverNode heartbeat: [15461]: info: mach_down takeover complete.
ServerNode2 系统日志:
Jun 8 14:31:33 serverNode2 heartbeat: [1407]: WARN: node serverNode: is dead
Jun 8 14:31:33 serverNode2 heartbeat: [1407]: WARN: No STONITH device configured.
Jun 8 14:31:33 serverNode2 heartbeat: [1407]: WARN: Shared disks are not protected.
Jun 8 14:31:33 serverNode2 heartbeat: [1407]: info: Resources being acquired from serverNode.
Jun 8 14:31:33 serverNode2 heartbeat: [1407]: info: Link serverNode:eth0 dead.
Jun 8 14:31:33 serverNode2 heartbeat: [30881]: debug: notify_world: setting SIGCHLD Handler to SIG_DFL
Jun 8 14:31:33 serverNode2 harc[30881]: info: Running /etc/ha.d//rc.d/status status
Jun 8 14:31:33 serverNode2 heartbeat: [30882]: info: No local resources [/usr/share/heartbeat/ResourceManager listkeys serverNode2] to acquire.
Jun 8 14:31:33 serverNode2 heartbeat: [1407]: debug: StartNextRemoteRscReq(): child count 1
Jun 8 14:31:33 serverNode2 mach_down[30909]: info: Taking over resource group xxx.xxx.xxx.88
Jun 8 14:31:33 serverNode2 ResourceManager[30934]: info: Acquiring resource group: serverNode xxx.xxx.xxx.88
Jun 8 14:31:33 serverNode2 IPaddr[30961]: INFO: Resource is stopped
Jun 8 14:31:33 serverNode2 ResourceManager[30934]: info: Running /etc/ha.d/resource.d/IPaddr xxx.xxx.xxx.88 start
Jun 8 14:31:33 serverNode2 IPaddr[31019]: INFO: Using calculated nic for xxx.xxx.xxx.88: eth0
Jun 8 14:31:33 serverNode2 IPaddr[31019]: INFO: Using calculated netmask for xxx.xxx.xxx.88: 255.255.255.0
Jun 8 14:31:33 serverNode2 IPaddr[31019]: INFO: eval ifconfig eth0:0 xxx.xxx.xxx.88 netmask 255.255.255.0 broadcast xxx.xxx.xxx.255
Jun 8 14:31:33 serverNode2 IPaddr[31007]: INFO: Success
Jun 8 14:31:33 serverNode2 mach_down[30909]: info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired
Jun 8 14:31:33 serverNode2 heartbeat: [1407]: info: mach_down takeover complete.
Jun 8 14:31:33 serverNode2 mach_down[30909]: info: mach_down takeover complete for node serverNode.
答案1
不,时钟不准不会破坏关系。但是,如果时钟发生剧烈变化,则会导致日志中出现错误,错误内容如下:
heartbeat: 2004/11/10_21:08:49 info: Clock jumped backwards. Compensating.
但它不会摧毁主节点。
看起来服务器之间的通信中断了。具体来说,看起来服务器 1 不再能够发送数据,或者服务器 2 无法正常接收数据。这可能是由于某些缓冲区问题造成的。您是否在跟踪网络缓冲区空间?(通过 snmp 或 netstat)或者可能是某个地方的网络问题,交换机端口错误?
当您说网站瘫痪时,您是否对每个 serverX 上的服务进行了监控测试,并针对该服务器的特定 IP 进行了测试?除了 VIP 无法工作之外,它是否表明在此期间是否有任何服务器瘫痪?流量图或错误/丢失计数是否显示了该时间段内的任何有趣信息?
答案2
需要更多信息。
- 物理拓扑。这些主机如何物理地相互连接?
- 每个主机的心跳配置 (ha.cf) 和 iptables 规则。具体来说,您使用的是广播 (bcast)、多播 (mcast) 还是单播 (ucast)。另外,请指定心跳的版本。
我怀疑有什么东西在过滤你的心跳节点之间的流量。Iptables 是一种可能。根据你的物理拓扑,其他设备也可能是可疑的。