我刚刚设置了 slurm,其中一台物理机器将是集群中唯一的系统(到目前为止)。这是在 Ubuntu 18.04 上。
我正在运行 slurmdbd,但是当我尝试启动 slurmd 和 slurmctld 时,出现了超时。为什么?
我正在发出以下命令:
systemctl start slurmctld
systemctl start slurmd
我也尝试过:
systemctl start slurmctld slurmd
和:
systemctl start slurmd slurmctld
对于 slurmctld 来说,此操作失败,并显示以下信息:
systemd[1]: slurmd.service: Can't open PID file /var/run/slurm-llnl/slurm-llnl/slurmd.pid (yet?) after start: No such file or directory
systemd[1]: slurmctld.service: Start operation timed out. Terminating.
systemd[1]: slurmctld.service: Failed with result 'timeout'.
systemd[1]: Failed to start Slurm controller daemon.
对于 slurmd:
systemd[1]: slurmd.service: Start operation timed out. Terminating.
systemd[1]: slurmd.service: Failed with result 'timeout'.
systemd[1]: Failed to start Slurm node daemon.
但是,当我通过发出以下命令手动启动这些(使用两个终端)时:
slurmctld -Dvvv
slurmd -Dvvv
一切似乎都正常。
这是为什么?我该如何启动 slurm?
这些是服务文件(应该是标准的,除了添加详细参数之外我没有触碰它们,但后来又将它们删除):
# cat /lib/systemd/system/slurmd.service
[Unit]
Description=Slurm node daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm-llnl/slurm.conf
Documentation=man:slurmd(8)
[Service]
Type=forking
EnvironmentFile=-/etc/default/slurmd
ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurm-llnl/slurmd.pid
KillMode=process
LimitNOFILE=51200
LimitMEMLOCK=infinity
LimitSTACK=infinity
[Install]
WantedBy=multi-user.target
# cat /lib/systemd/system/slurmctld.service
[Unit]
Description=Slurm controller daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm-llnl/slurm.conf
Documentation=man:slurmctld(8)
[Service]
Type=forking
EnvironmentFile=-/etc/default/slurmctld
ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurm-llnl/slurmctld.pid
[Install]
WantedBy=multi-user.target
答案1
仔细查看你的日志:
Can't open PID file /var/run/slurm-llnl/slurm-llnl/slurmd.pid
此路径与 中声明的路径不匹配/lib/systemd/system/slurmd.service
。要修复此问题,应更正SlurmdPidFile
文件中的字段/etc/slurm-llnl/slurm.conf
。 也一样SlurmctldPidFile
。
/usr/share/doc/slurm-wlm-doc/html/configurator.easy.html
还要注意,默认提供的简易配置器/var/run/slurmd.pid
也会失败。