作为 systemd 服务运行的 bash 脚本首次重新启动另一个 systemd 服务时出现问题

2024-9-14 • tag-icon

作为 systemd 服务运行的 bash 脚本首次重新启动另一个 systemd 服务时出现问题

我编写了一个 bash 循环并将其命名为 freud.sh。

弗洛伊德

#!/bin/bash
set -x

while :
do
sleep 30
systemctl stop pacemaker
sleep 30
systemctl start pacemaker
done

这是我的单位文件：

[Unit]
Description=Freud SplitBrain Service

[Service]
ExecStart=/root/utils/freud.sh

[Install]
WantedBy=multi-user.target

Systemd 在第一次尝试在 bash 脚本 (systemtl start pacemaker) 中启动起搏器服务时重新启动该服务，如“journalctl -fu freud”中所示

Jul 08 15:19:52 ENG_QA-HA2 systemd[1]: Started Freud SplitBrain Service.
Jul 08 15:19:53 ENG_QA-HA2 freud.sh[5460]: + :
Jul 08 15:19:53 ENG_QA-HA2 freud.sh[5460]: + sleep 30
Jul 08 15:20:23 ENG_QA-HA2 freud.sh[5460]: + systemctl stop pacemaker
Jul 08 15:20:26 ENG_QA-HA2 freud.sh[5460]: + sleep 30
Jul 08 15:20:39 ENG_QA-HA2 systemd[1]: Stopping Freud SplitBrain Service...
Jul 08 15:20:39 ENG_QA-HA2 systemd[1]: Stopped Freud SplitBrain Service.
Jul 08 15:20:39 ENG_QA-HA2 systemd[1]: Started Freud SplitBrain Service.
Jul 08 15:20:39 ENG_QA-HA2 freud.sh[6897]: + :
Jul 08 15:20:39 ENG_QA-HA2 freud.sh[6897]: + sleep 30
Jul 08 15:21:09 ENG_QA-HA2 freud.sh[6897]: + systemctl stop pacemaker
Jul 08 15:21:09 ENG_QA-HA2 freud.sh[6897]: + sleep 30
Jul 08 15:21:39 ENG_QA-HA2 freud.sh[6897]: + systemctl start pacemaker
Jul 08 15:21:39 ENG_QA-HA2 freud.sh[6897]: + :

它按上面列出的后续时间工作，但在我的“真实”脚本中，我已经在 while 循环中传递了该逻辑，该逻辑确定它需要停止和启动，并且它使起搏器服务处于停止状态。

当我在 ssh 会话中或在 cosole 上以 bash 脚本的形式运行该脚本时，它按预期工作。有人知道如何排除第一次尝试重新启动我的服务的原因吗？

如果您需要更多信息，请告诉我。

編輯：

[Unit]
Description=Pacemaker High Availability Cluster Manager
Documentation=man:pacemakerd
Documentation=https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html

# DefaultDependencies takes care of sysinit.target,
# basic.target, and shutdown.target

# We need networking to bind to a network address. It is recommended not to
# use Wants or Requires with network.target, and not to use
# network-online.target for server daemons.
After=network.target

# Time syncs can make the clock jump backward, which messes with logging
# and failure timestamps, so wait until it's done.
After=time-sync.target

# Managing systemd resources requires DBus.
After=dbus.service
Wants=dbus.service

# Some OCF resources may have dependencies that aren't managed by the cluster;
# these must be started before Pacemaker and stopped after it. The
# resource-agents package provides this target, which lets system adminstrators
# add drop-ins for those dependencies.
After=resource-agents-deps.target
Wants=resource-agents-deps.target

After=syslog.service
After=rsyslog.service
After=corosync.service
Requires=corosync.service


[Install]
WantedBy=multi-user.target


[Service]
Type=simple
KillMode=process
NotifyAccess=main
EnvironmentFile=-/etc/sysconfig/pacemaker
EnvironmentFile=-/etc/sysconfig/sbd
SuccessExitStatus=100

ExecStart=/usr/sbin/pacemakerd -f

# Uncomment TasksMax if your systemd version supports it.
# Only systemd v227 and above support this option.
#TasksMax=infinity

# If pacemakerd doesn't stop, it's probably waiting on a cluster
# resource.  Sending -KILL will just get the node fenced
SendSIGKILL=no

# If we ever hit the StartLimitInterval/StartLimitBurst limit and the
# admin wants to stop the cluster while pacemakerd is not running, it
# might be a good idea to enable the ExecStopPost directive below.
#
# Although the node will likely end up being fenced as a result so it's
# not on by default
#
# ExecStopPost=/usr/bin/killall -TERM crmd attrd stonithd cib pengine lrmd

# If you want Corosync to stop whenever Pacemaker is stopped,
# uncomment the next line too:
#
# ExecStopPost=/bin/sh -c 'pidof crmd || killall -TERM corosync'

# Uncomment this for older versions of systemd that didn't support
# TimeoutStopSec
# TimeoutSec=30min

# Pacemaker can only exit after all managed services have shut down
# A HA database could conceivably take even longer than this
TimeoutStopSec=30min
TimeoutStartSec=60s

# Restart options include: no, on-success, on-failure, on-abort or always
Restart=on-failure

# crm_perror() writes directly to stderr, so ignore it here
# to avoid double-logging with the wrong format
StandardError=null

包括 Pacemaker 服务的单元文件内容。有趣的是，使用命令“pcs 维护节点 $HOSTNAME”具有相同的行为，所以我开始认为您发现了一些问题，可能是 pacemakerd 进程而不是 systemd 导致了此问题。

知道如何排除故障吗？

更多编辑：

使用 strace 进一步调查后发现，pacemaker 正在分叉并导致 systemd 服务停止。或者至少开发人员是这样向我解释的。再次感谢任何帮助。

Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: write(2, "+ pcs node maintenance ENG_QA-HA"..., 34+ pcs node maintenance ENG_QA-HA2
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: ) = 34
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: stat(".", {st_mode=S_IFDIR|0555, st_size=4096, ...}) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: stat("/usr/local/sbin/pcs", 0x7fffec2c6530) = -1 ENOENT (No such file or directory)
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: stat("/usr/local/bin/pcs", 0x7fffec2c6530) = -1 ENOENT (No such file or directory)
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: stat("/usr/sbin/pcs", {st_mode=S_IFREG|0755, st_size=292, ...}) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: stat("/usr/sbin/pcs", {st_mode=S_IFREG|0755, st_size=292, ...}) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: geteuid()                               = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: getegid()                               = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: getuid()                                = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: getgid()                                = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: access("/usr/sbin/pcs", X_OK)           = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: stat("/usr/sbin/pcs", {st_mode=S_IFREG|0755, st_size=292, ...}) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: geteuid()                               = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: getegid()                               = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: getuid()                                = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: getgid()                                = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: access("/usr/sbin/pcs", R_OK)           = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: stat("/usr/sbin/pcs", {st_mode=S_IFREG|0755, st_size=292, ...}) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: stat("/usr/sbin/pcs", {st_mode=S_IFREG|0755, st_size=292, ...}) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: geteuid()                               = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: getegid()                               = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: getuid()                                = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: getgid()                                = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: access("/usr/sbin/pcs", X_OK)           = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: stat("/usr/sbin/pcs", {st_mode=S_IFREG|0755, st_size=292, ...}) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: geteuid()                               = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: getegid()                               = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: getuid()                                = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: getgid()                                = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: access("/usr/sbin/pcs", R_OK)           = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigprocmask(SIG_BLOCK, [CHLD], [INT CHLD], 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigprocmask(SIG_SETMASK, [INT CHLD], NULL, 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f188e823a10) = 32205
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigaction(SIGINT, {0x43e860, [], SA_RESTORER, 0x7f188de513b0}, {SIG_DFL, [], SA_RESTORER, 0x7f188de513b0}, 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 32205
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=32205, si_uid=0, si_status=0, si_utime=11, si_stime=2} ---
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: wait4(-1, 0x7fffec2c5f50, WNOHANG, NULL) = -1 ECHILD (No child processes)
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigreturn({mask=[]})                 = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f188de513b0}, {0x43e860, [], SA_RESTORER, 0x7f188de513b0}, 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: write(2, "+ sleep 90\n", 11+ sleep 90
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: )            = 11
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f188e823a10) = 32411
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
Jul 15 10:43:13 ENG_QA-HA2 strace[27232]: rt_sigaction(SIGINT, {0x43e860, [], SA_RESTORER, 0x7f188de513b0}, {SIG_DFL, [], SA_RESTORER, 0x7f188de513b0}, 8) = 0
Jul 15 10:43:26 ENG_QA-HA2 systemd[1]: Stopping Freud SplitBrain Service...

谢谢！

答案1

我通过编写一个 sentinel 文件并仅在满足条件时停止 pacemaker 解决了这个问题。当 systemd 重新启动服务时，它会在开始时检查该文件，如果检测到该文件，则会在 pacemaker 上运行启动。这可能是处理这个问题的更好方法，但问题在技术上已经解决，我可以继续前进，直到找到更好的解决方案。感谢大家的帮助！

答案1

相关内容