在依赖失败时重新启动 systemd 服务

在依赖失败时重新启动 systemd 服务

如果服务的依赖项之一在启动时失败(但重试后成功),处理重新启动服务的正确方法是什么。

这是一个人为的重现,以使问题更加清晰。

a.服务(模拟第一次尝试失败,第二次尝试成功)

[Unit]
Description=A

[Service]
ExecStartPre=/bin/sh -x -c "[ -f /tmp/success ] || (touch /tmp/success && sleep 10)"
ExecStart=/bin/true
TimeoutStartSec=5
Restart=on-failure
RestartSec=5
RemainAfterExit=yes

b.服务(A开始后就成功了)

[Unit]
Description=B
After=a.service
Requires=a.service

[Service]
ExecStart=/bin/true
RemainAfterExit=yes
Restart=on-failure
RestartSec=5

让我们开始b:

# systemctl start b
A dependency job for b.service failed. See 'journalctl -xe' for details.

日志:

Jun 30 21:34:54 debug systemd[1]: Starting A...
Jun 30 21:34:54 debug sh[1308]: + '[' -f /tmp/success ']'
Jun 30 21:34:54 debug sh[1308]: + touch /tmp/success
Jun 30 21:34:54 debug sh[1308]: + sleep 10
Jun 30 21:34:59 debug systemd[1]: a.service start-pre operation timed out. Terminating.
Jun 30 21:34:59 debug systemd[1]: Failed to start A.
Jun 30 21:34:59 debug systemd[1]: Dependency failed for B.
Jun 30 21:34:59 debug systemd[1]: Job b.service/start failed with result 'dependency'.
Jun 30 21:34:59 debug systemd[1]: Unit a.service entered failed state.
Jun 30 21:34:59 debug systemd[1]: a.service failed.
Jun 30 21:35:04 debug systemd[1]: a.service holdoff time over, scheduling restart.
Jun 30 21:35:04 debug systemd[1]: Starting A...
Jun 30 21:35:04 debug systemd[1]: Started A.
Jun 30 21:35:04 debug sh[1314]: + '[' -f /tmp/success ']'

A 已成功启动,但 B 处于失败状态并且不会重试。

编辑

我向这两个服务添加了以下内容,现在当 A 启动时 B 成功启动,但我无法解释原因。

[Install]
WantedBy=multi-user.target

为什么这会影响A和B之间的关系?

编辑2

上面的“修复”在 systemd 220 中不起作用。

systemd 219 调试日志

systemd219 systemd[1]: Trying to enqueue job b.service/start/replace
systemd219 systemd[1]: Installed new job b.service/start as 3454
systemd219 systemd[1]: Installed new job a.service/start as 3455
systemd219 systemd[1]: Enqueued job b.service/start as 3454
systemd219 systemd[1]: About to execute: /bin/sh -x -c '[ -f /tmp/success ] || (touch oldcoreos
systemd219 systemd[1]: Forked /bin/sh as 1502
systemd219 systemd[1]: a.service changed dead -> start-pre
systemd219 systemd[1]: Starting A...
systemd219 systemd[1502]: Executing: /bin/sh -x -c '[ -f /tmp/success ] || (touch /tmpoldcoreos
systemd219 sh[1502]: + '[' -f /tmp/success ']'
systemd219 sh[1502]: + touch /tmp/success
systemd219 sh[1502]: + sleep 10
systemd219 systemd[1]: a.service start-pre operation timed out. Terminating.
systemd219 systemd[1]: a.service changed start-pre -> final-sigterm
systemd219 systemd[1]: Child 1502 belongs to a.service
systemd219 systemd[1]: a.service: control process exited, code=killed status=15
systemd219 systemd[1]: a.service got final SIGCHLD for state final-sigterm
systemd219 systemd[1]: a.service changed final-sigterm -> failed
systemd219 systemd[1]: Job a.service/start finished, result=failed
systemd219 systemd[1]: Failed to start A.
systemd219 systemd[1]: Job b.service/start finished, result=dependency
systemd219 systemd[1]: Dependency failed for B.
systemd219 systemd[1]: Job b.service/start failed with result 'dependency'.
systemd219 systemd[1]: Unit a.service entered failed state.
systemd219 systemd[1]: a.service failed.
systemd219 systemd[1]: a.service changed failed -> auto-restart
systemd219 systemd[1]: a.service: cgroup is empty
systemd219 systemd[1]: a.service: cgroup is empty
systemd219 systemd[1]: a.service holdoff time over, scheduling restart.
systemd219 systemd[1]: Trying to enqueue job a.service/restart/fail
systemd219 systemd[1]: Installed new job a.service/restart as 3718
systemd219 systemd[1]: Installed new job b.service/restart as 3803
systemd219 systemd[1]: Enqueued job a.service/restart as 3718
systemd219 systemd[1]: a.service scheduled restart job.
systemd219 systemd[1]: Job b.service/restart finished, result=done
systemd219 systemd[1]: Converting job b.service/restart -> b.service/start
systemd219 systemd[1]: a.service changed auto-restart -> dead
systemd219 systemd[1]: Job a.service/restart finished, result=done
systemd219 systemd[1]: Converting job a.service/restart -> a.service/start
systemd219 systemd[1]: About to execute: /bin/sh -x -c '[ -f /tmp/success ] || (touch oldcoreos
systemd219 systemd[1]: Forked /bin/sh as 1558
systemd219 systemd[1]: a.service changed dead -> start-pre
systemd219 systemd[1]: Starting A...
systemd219 systemd[1]: Child 1558 belongs to a.service
systemd219 systemd[1]: a.service: control process exited, code=exited status=0
systemd219 systemd[1]: a.service got final SIGCHLD for state start-pre
systemd219 systemd[1]: About to execute: /bin/true
systemd219 systemd[1]: Forked /bin/true as 1561
systemd219 systemd[1]: a.service changed start-pre -> running
systemd219 systemd[1]: Job a.service/start finished, result=done
systemd219 systemd[1]: Started A.
systemd219 systemd[1]: Child 1561 belongs to a.service
systemd219 systemd[1]: a.service: main process exited, code=exited, status=0/SUCCESS
systemd219 systemd[1]: a.service changed running -> exited
systemd219 systemd[1]: a.service: cgroup is empty
systemd219 systemd[1]: About to execute: /bin/true
systemd219 systemd[1]: Forked /bin/true as 1563
systemd219 systemd[1]: b.service changed dead -> running
systemd219 systemd[1]: Job b.service/start finished, result=done
systemd219 systemd[1]: Started B.
systemd219 systemd[1]: Starting B...
systemd219 systemd[1]: Child 1563 belongs to b.service
systemd219 systemd[1]: b.service: main process exited, code=exited, status=0/SUCCESS
systemd219 systemd[1]: b.service changed running -> exited
systemd219 systemd[1]: b.service: cgroup is empty
systemd219 sh[1558]: + '[' -f /tmp/success ']'

systemd 220 调试日志

systemd220 systemd[1]: b.service: Trying to enqueue job b.service/start/replace
systemd220 systemd[1]: a.service: Installed new job a.service/start as 4846
systemd220 systemd[1]: b.service: Installed new job b.service/start as 4761
systemd220 systemd[1]: b.service: Enqueued job b.service/start as 4761
systemd220 systemd[1]: a.service: About to execute: /bin/sh -x -c '[ -f /tmp/success ] || (touch /tmp/success && sleep 10)'
systemd220 systemd[1]: a.service: Forked /bin/sh as 2032
systemd220 systemd[1]: a.service: Changed dead -> start-pre
systemd220 systemd[1]: Starting A...
systemd220 systemd[2032]: a.service: Executing: /bin/sh -x -c '[ -f /tmp/success ] || (touch /tmp/success && sleep 10)'
systemd220 sh[2032]: + '[' -f /tmp/success ']'
systemd220 sh[2032]: + touch /tmp/success
systemd220 sh[2032]: + sleep 10
systemd220 systemd[1]: a.service: Start-pre operation timed out. Terminating.
systemd220 systemd[1]: a.service: Changed start-pre -> final-sigterm
systemd220 systemd[1]: a.service: Child 2032 belongs to a.service
systemd220 systemd[1]: a.service: Control process exited, code=killed status=15
systemd220 systemd[1]: a.service: Got final SIGCHLD for state final-sigterm.
systemd220 systemd[1]: a.service: Changed final-sigterm -> failed
systemd220 systemd[1]: a.service: Job a.service/start finished, result=failed
systemd220 systemd[1]: Failed to start A.
systemd220 systemd[1]: b.service: Job b.service/start finished, result=dependency
systemd220 systemd[1]: Dependency failed for B.
systemd220 systemd[1]: b.service: Job b.service/start failed with result 'dependency'.
systemd220 systemd[1]: a.service: Unit entered failed state.
systemd220 systemd[1]: a.service: Failed with result 'timeout'.
systemd220 systemd[1]: a.service: Changed failed -> auto-restart
systemd220 systemd[1]: a.service: cgroup is empty
systemd220 systemd[1]: a.service: Failed to send unit change signal for a.service: Transport endpoint is not connected
systemd220 systemd[1]: a.service: Service hold-off time over, scheduling restart.
systemd220 systemd[1]: a.service: Trying to enqueue job a.service/restart/fail
systemd220 systemd[1]: a.service: Installed new job a.service/restart as 5190
systemd220 systemd[1]: a.service: Enqueued job a.service/restart as 5190
systemd220 systemd[1]: a.service: Scheduled restart job.
systemd220 systemd[1]: a.service: Changed auto-restart -> dead
systemd220 systemd[1]: a.service: Job a.service/restart finished, result=done
systemd220 systemd[1]: a.service: Converting job a.service/restart -> a.service/start
systemd220 systemd[1]: a.service: About to execute: /bin/sh -x -c '[ -f /tmp/success ] || (touch /tmp/success && sleep 10)'
systemd220 systemd[1]: a.service: Forked /bin/sh as 2132
systemd220 systemd[1]: a.service: Changed dead -> start-pre
systemd220 systemd[1]: Starting A...
systemd220 systemd[1]: a.service: Child 2132 belongs to a.service
systemd220 systemd[1]: a.service: Control process exited, code=exited status=0
systemd220 systemd[1]: a.service: Got final SIGCHLD for state start-pre.
systemd220 systemd[1]: a.service: About to execute: /bin/true
systemd220 systemd[1]: a.service: Forked /bin/true as 2136
systemd220 systemd[1]: a.service: Changed start-pre -> running
systemd220 systemd[1]: a.service: Job a.service/start finished, result=done
systemd220 systemd[1]: Started A.
systemd220 systemd[1]: a.service: Child 2136 belongs to a.service
systemd220 systemd[1]: a.service: Main process exited, code=exited, status=0/SUCCESS
systemd220 systemd[1]: a.service: Changed running -> exited
systemd220 systemd[1]: a.service: cgroup is empty
systemd220 systemd[1]: a.service: cgroup is empty
systemd220 systemd[1]: a.service: cgroup is empty
systemd220 systemd[1]: a.service: cgroup is empty
systemd220 sh[2132]: + '[' -f /tmp/success ']'

答案1

我将尝试总结我对此问题的发现,以防有人遇到此问题,因为有关此主题的信息很少。

  • Restart=on-failure仅适用于进程失败(不适用于由于依赖失败而导致的失败)
  • 当依赖项成功重启时,依赖失败的单元会在某些条件下重启,这是 systemd <220 中的一个错误: http://lists.freedesktop.org/archives/systemd-devel/2015-July/033513.html
  • 如果依赖项在启动时失败的可能性很小,并且您关心弹性,请不要使用Before/After而是对依赖项产生的某些工件执行检查

例如

ExecStartPre=/usr/bin/test -f /some/thing
Restart=on-failure
RestartSec=5s

你甚至可以使用systemctl is-active <dependecy>.

非常hacky,但我还没有找到更好的选择。

在我看来,没有办法处理依赖失败是 systemd 的一个缺陷。

答案2

这似乎是一种可以很容易地编写脚本并放入 cronjob 的东西。基本逻辑是这样的

  1. 检查服务 a 和 b 以及依赖项是否正在运行/处于有效状态。您将知道检查一切是否正常工作的最佳方法
  2. 如果一切正常,则不执行任何操作或记录一切正常。日志记录的优点是允许您查找以前的日志条目。
  3. 如果出现问题,请重新启动服务并跳回执行服务和依赖项状态检查的脚本开头。仅当您有信心重新启动服务且依赖项很有可能正常工作时,才应发生跳转,否则可能会出现循环。
  4. 让 cron 稍后再次运行该脚本

一旦设置了脚本,cron 就是测试它的好地方,如果 cron 效率低下,那么脚本将是尝试编写低级系统服务的良好起点,该系统服务可以检查某些其他服务的状态并根据需要重新启动它们。根据您想要投入的精力,脚本甚至可以设置为根据结果向您发送电子邮件(当然,除非所涉及的服务是网络服务)。

答案3

After并且Before仅设置服务启动的顺序,您的服务文件显示“如果 A 和 B 将启动,则 A 必须在 B 之前启动”。

Requires意味着如果要启动此服务,则必须首先启动该服务,在您的示例中“如果 B 已启动且 A 未运行,则启动 A”

当您添加时,WantedBy=multi-user.target您现在告诉系统必须在系统初始化时启动服务multi-user.target,大概这意味着一旦添加它,您就让系统启动服务而不是手动启动它们?

我不确定为什么这在版本 220 中不起作用,可能值得尝试 222。当我有机会时,我会挖出一个虚拟机并尝试您的服务。

答案4

https://github.com/systemd/systemd/issues/1312#issuecomment-1139234399

出于兴趣,是否会Upholds=在提交中引入新的0bc488cPR #19322 的内容,对这里有帮助吗?

具体来说,假设b.service配置有Wants=a.serviceAfter=a.serviceRestart=on-failure。然后a.service在启动时超时,因此b.service也会失败并显示b.service/start failed with result 'dependency'。接下来,a.service重新启动后成功,但b.service从未启动,因为显然,依赖项失败不被模式视为触发器Restart=on-failure

(这正是已经链接的 Stack Exchange 中报告的情况问题其中一个答案评论也链接回此问题。)

遇到这个长期存在的问题后,我想知道Upholds=如果指定该指令a.service并配置了Upholds=b.service? ,该指令是否会产生影响?当最终出现时,这会导致b.service启动吗?a.service

如果你可以将 systemd 升级到v254试用 RestartMode=direct:https://www.freedesktop.org/software/systemd/man/latest/systemd.service.html#RestartMode=

相关内容