我有一个带有 Apache、MySQL、ActiveMQ 和 DRBD 的主动-被动 Heartbeat 集群。
今天,我想在辅助节点(node04)上执行硬件维护,所以我在关闭它之前停止了心跳服务。
然后,主节点(node03)收到了从节点(node04)发来的关机通知。
该日志来自主节点:node03
heartbeat[4458]: 2010/03/08_08:52:56 info: Received shutdown notice from 'node04.companydomain.nl'.
heartbeat[4458]: 2010/03/08_08:52:56 info: Resources being acquired from node04.companydomain.nl.
harc[27522]: 2010/03/08_08:52:56 info: Running /etc/ha.d/rc.d/status status
heartbeat[27523]: 2010/03/08_08:52:56 info: Local Resource acquisition completed.
mach_down[27567]: 2010/03/08_08:52:56 info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired
mach_down[27567]: 2010/03/08_08:52:56 info: mach_down takeover complete for node node04.companydomain.nl.
heartbeat[4458]: 2010/03/08_08:52:56 info: mach_down takeover complete.
harc[27620]: 2010/03/08_08:52:56 info: Running /etc/ha.d/rc.d/ip-request-resp ip-request-resp
ip-request-resp[27620]: 2010/03/08_08:52:56 received ip-request-resp drbddisk OK yes
ResourceManager[27645]: 2010/03/08_08:52:56 info: Acquiring resource group: node03.companydomain.nl drbddisk Filesystem::/dev/drbd0::/data::ext3 mysql apache::/etc/httpd/conf/httpd.conf LVSSyncDaemonSwap::master monitor activemq tivoli-cluster MailTo::[email protected]::DRBDFailureAcc MailTo::[email protected]::DRBDFailureAcc 1.2.3.212
ResourceManager[27645]: 2010/03/08_08:52:56 info: Running /etc/ha.d/resource.d/drbddisk start
Filesystem[27700]: 2010/03/08_08:52:57 INFO: Running OK
ResourceManager[27645]: 2010/03/08_08:52:57 info: Running /etc/ha.d/resource.d/mysql start
mysql[27783]: 2010/03/08_08:52:57 Starting MySQL[ OK ]
apache[27853]: 2010/03/08_08:52:57 INFO: Running OK
ResourceManager[27645]: 2010/03/08_08:52:57 info: Running /etc/ha.d/resource.d/monitor start
monitor[28160]: 2010/03/08_08:52:58
ResourceManager[27645]: 2010/03/08_08:52:58 info: Running /etc/ha.d/resource.d/activemq start
activemq[28210]: 2010/03/08_08:52:58 Starting ActiveMQ Broker... ActiveMQ Broker is already running.
ResourceManager[27645]: 2010/03/08_08:52:58 ERROR: Return code 1 from /etc/ha.d/resource.d/activemq
ResourceManager[27645]: 2010/03/08_08:52:58 CRIT: Giving up resources due to failure of activemq
ResourceManager[27645]: 2010/03/08_08:52:58 info: Releasing resource group: node03.companydomain.nl drbddisk Filesystem::/dev/drbd0::/data::ext3 mysql apache::/etc/httpd/conf/httpd.conf LVSSyncDaemonSwap::master monitor activemq tivoli-cluster MailTo::[email protected]::DRBDFailureAcc MailTo::[email protected]::DRBDFailureAcc 1.2.3.212
ResourceManager[27645]: 2010/03/08_08:52:58 info: Running /etc/ha.d/resource.d/IPaddr 1.2.3.212 stop
IPaddr[28329]: 2010/03/08_08:52:58 INFO: ifconfig eth0:0 down
IPaddr[28312]: 2010/03/08_08:52:58 INFO: Success
ResourceManager[27645]: 2010/03/08_08:52:58 info: Running /etc/ha.d/resource.d/MailTo [email protected] DRBDFailureAcc stop
MailTo[28378]: 2010/03/08_08:52:58 INFO: Success
ResourceManager[27645]: 2010/03/08_08:52:58 info: Running /etc/ha.d/resource.d/MailTo [email protected] DRBDFailureAcc stop
MailTo[28433]: 2010/03/08_08:52:58 INFO: Success
ResourceManager[27645]: 2010/03/08_08:52:58 info: Running /etc/ha.d/resource.d/tivoli-cluster stop
ResourceManager[27645]: 2010/03/08_08:52:58 info: Running /etc/ha.d/resource.d/activemq stop
activemq[28503]: 2010/03/08_08:53:01 Stopping ActiveMQ Broker... Stopped ActiveMQ Broker.
ResourceManager[27645]: 2010/03/08_08:53:01 info: Running /etc/ha.d/resource.d/monitor stop
monitor[28681]: 2010/03/08_08:53:01
ResourceManager[27645]: 2010/03/08_08:53:01 info: Running /etc/ha.d/resource.d/LVSSyncDaemonSwap master stop
LVSSyncDaemonSwap[28714]: 2010/03/08_08:53:02 info: ipvs_syncmaster down
LVSSyncDaemonSwap[28714]: 2010/03/08_08:53:02 info: ipvs_syncbackup up
LVSSyncDaemonSwap[28714]: 2010/03/08_08:53:02 info: ipvs_syncmaster released
ResourceManager[27645]: 2010/03/08_08:53:02 info: Running /etc/ha.d/resource.d/apache /etc/httpd/conf/httpd.conf stop
apache[28782]: 2010/03/08_08:53:03 INFO: Killing apache PID 18390
apache[28782]: 2010/03/08_08:53:03 INFO: apache stopped.
apache[28771]: 2010/03/08_08:53:03 INFO: Success
ResourceManager[27645]: 2010/03/08_08:53:03 info: Running /etc/ha.d/resource.d/mysql stop
mysql[28851]: 2010/03/08_08:53:24 Shutting down MySQL.....................[ OK ]
ResourceManager[27645]: 2010/03/08_08:53:24 info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd0 /data ext3 stop
Filesystem[29010]: 2010/03/08_08:53:25 INFO: Running stop for /dev/drbd0 on /data
Filesystem[29010]: 2010/03/08_08:53:25 INFO: Trying to unmount /data
Filesystem[29010]: 2010/03/08_08:53:25 ERROR: Couldn't unmount /data; trying cleanup with SIGTERM
Filesystem[29010]: 2010/03/08_08:53:25 INFO: Some processes on /data were signalled
Filesystem[29010]: 2010/03/08_08:53:27 INFO: unmounted /data successfully
Filesystem[28999]: 2010/03/08_08:53:27 INFO: Success
ResourceManager[27645]: 2010/03/08_08:53:27 info: Running /etc/ha.d/resource.d/drbddisk stop
heartbeat[4458]: 2010/03/08_08:53:29 WARN: node node04.companydomain.nl: is dead
heartbeat[4458]: 2010/03/08_08:53:29 info: Dead node node04.companydomain.nl gave up resources.
heartbeat[4458]: 2010/03/08_08:53:29 info: Link node04.companydomain.nl:eth0 dead.
heartbeat[4458]: 2010/03/08_08:53:29 info: Link node04.companydomain.nl:eth1 dead.
hb_standby[29193]: 2010/03/08_08:53:57 Going standby [foreign].
heartbeat[4458]: 2010/03/08_08:53:57 info: node03.companydomain.nl wants to go standby [foreign]
那么...这里刚才发生了什么事???
- node04 上的心跳停止了,并通知了当时活动节点 node03。
- 不知何故,node03 决定启动已在运行的集群进程。(对于不重要的进程,我总是从启动脚本返回 0,这样当非必要部分发生故障时,它不会停止整个集群。)
- 当启动ActiveMQ时,它返回状态1,因为它已经在运行。
- 这会导致节点故障并关闭所有设备。由于心跳未在辅助节点上运行,因此无法故障转移到该节点。
当我尝试运行 ha_takeover 来重新启动资源时,什么也没有发生。
只有在我重新启动主节点上的心跳后,资源才能启动(延迟 2 分钟后)。
这些是我的问题:
- 为什么主节点上的心跳会尝试重新启动集群进程?
- 为什么 ha_takeover 不起作用?
- 我该怎么做才能防止这种情况发生?
服务器配置:
DRBD:
version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by [email protected], 2010-01-20 09:14:48
0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate B r----
ns:0 nr:6459432 dw:6459432 dr:0 al:0 bm:301 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0
uname -a
Linux node04 2.6.18-164.11.1.el5 #1 SMP Wed Jan 6 13:26:04 EST 2010 x86_64 x86_64 x86_64 GNU/Linux
人力资源
node03.companydomain.nl \
drbddisk \
Filesystem::/dev/drbd0::/data::ext3 \
mysql \
apache::/etc/httpd/conf/httpd.conf \
LVSSyncDaemonSwap::master \
monitor \
activemq \
tivoli-cluster \
MailTo::[email protected]::DRBDFailureAcc \
MailTo::[email protected]::DRBDFailureAcc \
1.2.3.212
哈夫
debugfile /var/log/ha-debug
logfile /var/log/ha-log
keepalive 500ms
deadtime 30
warntime 10
initdead 120
udpport 694
mcast eth0 225.0.0.3 694 1 0
mcast eth1 225.0.0.4 694 1 0
auto_failback off
node node03.companydomain.nl
node node04.companydomain.nl
respawn hacluster /usr/lib64/heartbeat/dopd
apiauth dopd gid=haclient uid=hacluster
提前非常感谢,
德国阿珀尔多伦
答案1
不管怎样,我理解你的痛苦。似乎 heartbeat 认为被动节点的丢失与被动节点的接管相同,因此它会启动其服务。当启动脚本失败,并且没有其他节点可以进行故障转移时,heartbeat 保持主节点并关闭所有服务。当发生这种情况时,恢复运行的唯一方法是重新启动 heartbeat。
我们通过制作一个脚本来解决这个问题,该脚本仅在集群服务(IP、FS mount、ipvsadm、Apache 等)尚未运行时才启动它们。我们确保“一体化”初始化脚本仅在实际启动失败时返回非零值(而不是“已在运行”等警告),以避免出现此类问题。
答案2
这不是心跳错误。这是一些 init 脚本中常见的错误。 阅读手册:标准规定:
- 停止已停止的资源没有错误
- 启动已启动资源没有错误
那么哪里出了问题?ActiveMQ 已启动并正在运行。
这不是错误!但是:它返回 1=error 而不是 0=ok - 因此心跳断定存在错误并停止了整个资源组。
因此,如果您使用初始化脚本进行心跳,请确保它们符合 LSB。