为什么 Slurm 无法通过 systemd 启动,但手动启动时可以运行?

为什么 Slurm 无法通过 systemd 启动,但手动启动时可以运行?

我刚刚设置了 slurm,其中一台物理机器将是集群中唯一的系统(到目前为止)。这是在 Ubuntu 18.04 上。

我正在运行 slurmdbd,但是当我尝试启动 slurmd 和 slurmctld 时,出现了超时。为什么?

我正在发出以下命令:

systemctl start slurmctld
systemctl start slurmd

我也尝试过:

systemctl start slurmctld slurmd

和:

systemctl start slurmd slurmctld

对于 slurmctld 来说,此操作失败,并显示以下信息:

systemd[1]: slurmd.service: Can't open PID file /var/run/slurm-llnl/slurm-llnl/slurmd.pid (yet?) after start: No such file or directory
systemd[1]: slurmctld.service: Start operation timed out. Terminating.
systemd[1]: slurmctld.service: Failed with result 'timeout'.
systemd[1]: Failed to start Slurm controller daemon.

对于 slurmd:

systemd[1]: slurmd.service: Start operation timed out. Terminating.
systemd[1]: slurmd.service: Failed with result 'timeout'.
systemd[1]: Failed to start Slurm node daemon.

但是,当我通过发出以下命令手动启动这些(使用两个终端)时:

slurmctld -Dvvv
slurmd -Dvvv

一切似乎都正常。

这是为什么?我该如何启动 slurm?

这些是服务文件(应该是标准的,除了添加详细参数之外我没有触碰它们,但后来又将它们删除):

# cat /lib/systemd/system/slurmd.service 
[Unit]
Description=Slurm node daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm-llnl/slurm.conf
Documentation=man:slurmd(8)

[Service]
Type=forking
EnvironmentFile=-/etc/default/slurmd
ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurm-llnl/slurmd.pid
KillMode=process
LimitNOFILE=51200
LimitMEMLOCK=infinity
LimitSTACK=infinity

[Install]
WantedBy=multi-user.target
# cat /lib/systemd/system/slurmctld.service 
[Unit]
Description=Slurm controller daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm-llnl/slurm.conf
Documentation=man:slurmctld(8)

[Service]
Type=forking
EnvironmentFile=-/etc/default/slurmctld
ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurm-llnl/slurmctld.pid

[Install]
WantedBy=multi-user.target

答案1

仔细查看你的日志:

Can't open PID file /var/run/slurm-llnl/slurm-llnl/slurmd.pid

此路径与 中声明的路径不匹配/lib/systemd/system/slurmd.service。要修复此问题,应更正SlurmdPidFile文件中的字段/etc/slurm-llnl/slurm.conf。 也一样SlurmctldPidFile

/usr/share/doc/slurm-wlm-doc/html/configurator.easy.html还要注意,默认提供的简易配置器/var/run/slurmd.pid也会失败。

相关内容