我编写了一个 bash 循环并将其命名为 freud.sh。
弗洛伊德
#!/bin/bash
set -x
while :
do
sleep 30
systemctl stop pacemaker
sleep 30
systemctl start pacemaker
done
这是我的单位文件:
[Unit]
Description=Freud SplitBrain Service
[Service]
ExecStart=/root/utils/freud.sh
[Install]
WantedBy=multi-user.target
Systemd 在第一次尝试在 bash 脚本 (systemtl start pacemaker) 中启动起搏器服务时重新启动该服务,如“journalctl -fu freud”中所示
Jul 08 15:19:52 ENG_QA-HA2 systemd[1]: Started Freud SplitBrain Service.
Jul 08 15:19:53 ENG_QA-HA2 freud.sh[5460]: + :
Jul 08 15:19:53 ENG_QA-HA2 freud.sh[5460]: + sleep 30
Jul 08 15:20:23 ENG_QA-HA2 freud.sh[5460]: + systemctl stop pacemaker
Jul 08 15:20:26 ENG_QA-HA2 freud.sh[5460]: + sleep 30
Jul 08 15:20:39 ENG_QA-HA2 systemd[1]: Stopping Freud SplitBrain Service...
Jul 08 15:20:39 ENG_QA-HA2 systemd[1]: Stopped Freud SplitBrain Service.
Jul 08 15:20:39 ENG_QA-HA2 systemd[1]: Started Freud SplitBrain Service.
Jul 08 15:20:39 ENG_QA-HA2 freud.sh[6897]: + :
Jul 08 15:20:39 ENG_QA-HA2 freud.sh[6897]: + sleep 30
Jul 08 15:21:09 ENG_QA-HA2 freud.sh[6897]: + systemctl stop pacemaker
Jul 08 15:21:09 ENG_QA-HA2 freud.sh[6897]: + sleep 30
Jul 08 15:21:39 ENG_QA-HA2 freud.sh[6897]: + systemctl start pacemaker
Jul 08 15:21:39 ENG_QA-HA2 freud.sh[6897]: + :
它按上面列出的后续时间工作,但在我的“真实”脚本中,我已经在 while 循环中传递了该逻辑,该逻辑确定它需要停止和启动,并且它使起搏器服务处于停止状态。
当我在 ssh 会话中或在 cosole 上以 bash 脚本的形式运行该脚本时,它按预期工作。有人知道如何排除第一次尝试重新启动我的服务的原因吗?
如果您需要更多信息,请告诉我。
編輯:
[Unit]
Description=Pacemaker High Availability Cluster Manager
Documentation=man:pacemakerd
Documentation=https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html
# DefaultDependencies takes care of sysinit.target,
# basic.target, and shutdown.target
# We need networking to bind to a network address. It is recommended not to
# use Wants or Requires with network.target, and not to use
# network-online.target for server daemons.
After=network.target
# Time syncs can make the clock jump backward, which messes with logging
# and failure timestamps, so wait until it's done.
After=time-sync.target
# Managing systemd resources requires DBus.
After=dbus.service
Wants=dbus.service
# Some OCF resources may have dependencies that aren't managed by the cluster;
# these must be started before Pacemaker and stopped after it. The
# resource-agents package provides this target, which lets system adminstrators
# add drop-ins for those dependencies.
After=resource-agents-deps.target
Wants=resource-agents-deps.target
After=syslog.service
After=rsyslog.service
After=corosync.service
Requires=corosync.service
[Install]
WantedBy=multi-user.target
[Service]
Type=simple
KillMode=process
NotifyAccess=main
EnvironmentFile=-/etc/sysconfig/pacemaker
EnvironmentFile=-/etc/sysconfig/sbd
SuccessExitStatus=100
ExecStart=/usr/sbin/pacemakerd -f
# Uncomment TasksMax if your systemd version supports it.
# Only systemd v227 and above support this option.
#TasksMax=infinity
# If pacemakerd doesn't stop, it's probably waiting on a cluster
# resource. Sending -KILL will just get the node fenced
SendSIGKILL=no
# If we ever hit the StartLimitInterval/StartLimitBurst limit and the
# admin wants to stop the cluster while pacemakerd is not running, it
# might be a good idea to enable the ExecStopPost directive below.
#
# Although the node will likely end up being fenced as a result so it's
# not on by default
#
# ExecStopPost=/usr/bin/killall -TERM crmd attrd stonithd cib pengine lrmd
# If you want Corosync to stop whenever Pacemaker is stopped,
# uncomment the next line too:
#
# ExecStopPost=/bin/sh -c 'pidof crmd || killall -TERM corosync'
# Uncomment this for older versions of systemd that didn't support
# TimeoutStopSec
# TimeoutSec=30min
# Pacemaker can only exit after all managed services have shut down
# A HA database could conceivably take even longer than this
TimeoutStopSec=30min
TimeoutStartSec=60s
# Restart options include: no, on-success, on-failure, on-abort or always
Restart=on-failure
# crm_perror() writes directly to stderr, so ignore it here
# to avoid double-logging with the wrong format
StandardError=null
包括 Pacemaker 服务的单元文件内容。有趣的是,使用命令“pcs 维护节点 $HOSTNAME”具有相同的行为,所以我开始认为您发现了一些问题,可能是 pacemakerd 进程而不是 systemd 导致了此问题。
知道如何排除故障吗?
更多编辑:
使用 strace 进一步调查后发现,pacemaker 正在分叉并导致 systemd 服务停止。或者至少开发人员是这样向我解释的。再次感谢任何帮助。
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: write(2, "+ pcs node maintenance ENG_QA-HA"..., 34+ pcs node maintenance ENG_QA-HA2
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: ) = 34
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: stat(".", {st_mode=S_IFDIR|0555, st_size=4096, ...}) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: stat("/usr/local/sbin/pcs", 0x7fffec2c6530) = -1 ENOENT (No such file or directory)
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: stat("/usr/local/bin/pcs", 0x7fffec2c6530) = -1 ENOENT (No such file or directory)
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: stat("/usr/sbin/pcs", {st_mode=S_IFREG|0755, st_size=292, ...}) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: stat("/usr/sbin/pcs", {st_mode=S_IFREG|0755, st_size=292, ...}) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: geteuid() = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: getegid() = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: getuid() = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: getgid() = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: access("/usr/sbin/pcs", X_OK) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: stat("/usr/sbin/pcs", {st_mode=S_IFREG|0755, st_size=292, ...}) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: geteuid() = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: getegid() = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: getuid() = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: getgid() = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: access("/usr/sbin/pcs", R_OK) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: stat("/usr/sbin/pcs", {st_mode=S_IFREG|0755, st_size=292, ...}) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: stat("/usr/sbin/pcs", {st_mode=S_IFREG|0755, st_size=292, ...}) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: geteuid() = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: getegid() = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: getuid() = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: getgid() = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: access("/usr/sbin/pcs", X_OK) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: stat("/usr/sbin/pcs", {st_mode=S_IFREG|0755, st_size=292, ...}) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: geteuid() = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: getegid() = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: getuid() = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: getgid() = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: access("/usr/sbin/pcs", R_OK) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigprocmask(SIG_BLOCK, [CHLD], [INT CHLD], 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigprocmask(SIG_SETMASK, [INT CHLD], NULL, 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f188e823a10) = 32205
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigaction(SIGINT, {0x43e860, [], SA_RESTORER, 0x7f188de513b0}, {SIG_DFL, [], SA_RESTORER, 0x7f188de513b0}, 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 32205
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=32205, si_uid=0, si_status=0, si_utime=11, si_stime=2} ---
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: wait4(-1, 0x7fffec2c5f50, WNOHANG, NULL) = -1 ECHILD (No child processes)
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigreturn({mask=[]}) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f188de513b0}, {0x43e860, [], SA_RESTORER, 0x7f188de513b0}, 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: write(2, "+ sleep 90\n", 11+ sleep 90
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: ) = 11
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f188e823a10) = 32411
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigaction(SIGINT, {0x43e860, [], SA_RESTORER, 0x7f188de513b0}, {SIG_DFL, [], SA_RESTORER, 0x7f188de513b0}, 8) = 0
Jul 15 10:43:26 ENG_QA-HA2 systemd[1]: Stopping Freud SplitBrain Service...
谢谢!
答案1
我通过编写一个 sentinel 文件并仅在满足条件时停止 pacemaker 解决了这个问题。当 systemd 重新启动服务时,它会在开始时检查该文件,如果检测到该文件,则会在 pacemaker 上运行启动。这可能是处理这个问题的更好方法,但问题在技术上已经解决,我可以继续前进,直到找到更好的解决方案。感谢大家的帮助!