仅在达到突发限制后才触发 Service OnFailure

仅在达到突发限制后才触发 Service OnFailure

我使用 systemd 单元文件来控制服务器上运行的 python 进程(使用 systemd v247)。

此进程必须在退出后 60 秒重新启动,无论失败还是成功,除非在 600 秒内失败 5 次。

该单元文件链接另一个服务,以便通过电子邮件通知故障。

/etc/systemd/system/python-test.service

[Unit]
After=network.target
OnFailure=mailer@%n.service

[Service]
Type=simple

ExecStart=/home/debian/tmp.py

# Any exit status different than 0 is considered as an error
SuccessExitStatus=0

StandardOutput=append:/var/log/python-test.log
StandardError=append:/var/log/python-test.log

# Always restart service 60sec after exit
Restart=always
RestartSec=60

# Stop restarting service after 5 consecutive fail in 600sec interval
StartLimitInterval=600
StartLimitBurst=5

[Install]
WantedBy=multi-user.target

/etc/systemd/system/[email protected]

[Unit]
After=network.target

[Service]
Type=oneshot

ExecStart=/home/debian/mailer.py --to "[email protected]" --subject "Systemd service %I failed" --message "A systemd service failed %I on %H"

[Install]
WantedBy=multi-user.target

OnFailure在基本测试期间,触发效果非常好。然而,当我将以下部分添加到单元文件中时,只有OnFailure在发生 5 次连续失败时才会触发。

StartLimitInterval=600
StartLimitBurst=5

这不是我想要的行为,因为我希望每次进程失败时都会收到通知,即使尚未达到突发限制。


检查进程状态时,未达到突发限制时输出不一样

● python-test.service
     Loaded: loaded (/etc/systemd/system/python-test.service; disabled; vendor preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Thu 2022-12-22 19:51:23 UTC; 2s ago
    Process: 1421600 ExecStart=/home/debian/tmp.py (code=exited, status=1/FAILURE)
   Main PID: 1421600 (code=exited, status=1/FAILURE)
        CPU: 31ms

Dec 22 19:51:23 test-vps systemd[1]: python-test.service: Failed with result 'exit-code'.

比当它是

● python-test.service
     Loaded: loaded (/etc/systemd/system/python-test.service; disabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Thu 2022-12-22 19:52:02 UTC; 24s ago
    Process: 1421609 ExecStart=/home/debian/tmp.py (code=exited, status=1/FAILURE)
   Main PID: 1421609 (code=exited, status=1/FAILURE)
        CPU: 31ms

Dec 22 19:51:56 test-vps systemd[1]: python-test.service: Failed with result 'exit-code'.
Dec 22 19:52:02 test-vps systemd[1]: python-test.service: Scheduled restart job, restart counter is at 5.
Dec 22 19:52:02 test-vps systemd[1]: Stopped python-test.service.
Dec 22 19:52:02 test-vps systemd[1]: python-test.service: Start request repeated too quickly.
Dec 22 19:52:02 test-vps systemd[1]: python-test.service: Failed with result 'exit-code'.
Dec 22 19:52:02 test-vps systemd[1]: Failed to start python-test.service.
Dec 22 19:52:02 test-vps systemd[1]: python-test.service: Triggering OnFailure= dependencies.

我找不到任何解释如何修改OnFailure单元文件内的触发的内容。

有没有办法在每次进程失败时通知邮件并仍然保持突发限制?

答案1

为了按照您的需要使用系统服务,您应该做几件事(更改正在进行中)/etc/systemd/system/python-test.service)。

  1. 改成Restart=alwaysRestart=on-failure
  2. 这些值StartLimitInterval=600似乎StartLimitBurst=5还得到支持。但是您应该将它们放在[Unit].如果您放置StartLimitInterval[Unit]则可以将其重命名为StartLimitIntervalSec(改为man systemd.unit使用StartLimitIntervalSec)。
  3. 添加RemainAfterExit=no[Service]部分。
  4. 在部分中添加此行[Service]TimeoutStopSec=infinity
  5. 使用脚本中的环境变量EXIT_STATUS来确定脚本是否成功退出。
  6. 改成。OnFailure=mailer@%n.serviceOnFailure=mailer@%N.service两者的区别在于使用%N会删除后缀。
  7. 安装并启动服务atd( sudo systemctl start atd.service) 以便能够使用at命令。或者,如果您不想使用at,则可以编写另一个 systemd 服务来重新启动该服务。 (在这个例子中,我使用了relaunch.service
  8. sleep在和上使用相同的值RestartSec。就您而言,既然RestartSec60这一行中睡眠60也必须有:
 echo "sleep 60; sudo systemctl start ${1}.service" | at now
  1. 使用ExecStartExecStopPost=获取退出状态您的主要流程:/home/debian/tmp.py.不要使用ExecStop,来自man systemd.service

执行停止=

请注意,ExecStop= 中指定的命令仅在服务首次成功启动时执行。如果服务根本没有启动过,或者启动失败,例如因为 ExecStart=、ExecStartPre= 或 ExecStartPost= 中指定的任何命令失败(并且没有前缀“-”),则不会调用它们,见上文)或超时。当服务无法正确启动并再次关闭时,使用 ExecStopPost= 调用命令。


服务/etc/systemd/system/python-test.service应该:

[Unit]
After=network.target
OnFailure=mailer@%N.service

StartLimitBurst=5
StartLimitIntervalSec=600
 
[Service]  
Type=simple 
TimeoutStopSec=infinity
ExecStart=/home/debian/tmp.py
ExecStopPost=/bin/bash -c 'echo The Service  has exited with values: $$EXIT_STATUS,$$SERVICE_RESULT,$$EXIT_CODE'
ExecStopPost=/home/debian/bin/checkSuccess "%N"
# Any exit status different than 0 is considered as an error
SuccessExitStatus=0
StandardOutput=append:/tmp/python-out-test.log
StandardError=append:/tmp/python-err-test.log
# Always restart service 60sec after exit
Restart=on-failure
RestartSec=60
RemainAfterExit=no

[Install]
WantedBy=multi-user.target

/home/debian/bin/checkSuccess应该有这个:

解决方案一:使用at命令:

#!/bin/bash

if [ "$EXIT_STATUS" -eq 0 ]
then
   echo "sleep 60; sudo systemctl start ${1}.service" | at now
   exit 0
else
   systemctl start "mailer@${1}.service"
   exit 0
fi

解决方案2:使用另一个 systemd 服务:

#!/bin/bash

if [ "$EXIT_STATUS" -eq 0 ]
then
   systemctl start relaunch.service
else
   systemctl start "mailer@${1}.service"
fi
exit 0

并且relaunch.service应该有:

[Unit]
Description=Relaunch Python Test Service

[Service]
Type=simple
RemainAfterExit=no 
ExecStart=/bin/bash -c 'echo Delay; sleep 10 ; systemctl start python-test.service'

"$EXIT_STATUS"systemd 服务设置的变量由 的退出状态决定/home/debian/tmp.py

代表${1}单位的名称: python-test并将其传递给行中的脚本/home/debian/bin/checkSuccess "%N"


笔记:

  1. 'echo The Service %n has exited with values: $$EXIT_STATUS,$$SERVICE_RESULT,$$EXIT_CODE' 您可以使用以下命令实时检查日志:
tail -f /tmp/python-out-test.log
  1. relaunch.service如果您在想要停止主服务时使用解决方案 2( with ),您应该运行:
sudo systemctl stop relaunch.service
#Might not be necessary but you stop python service too:
# sudo systemctl stop python-test.service

相关内容