我正在尝试在 Ubuntu PC 上安装 slurm。因此,我按照上面给出的说明进行操作这里
我做了以下事情 -
sudo apt update -y
sudo apt install slurmd slurmctld -y
mkdir sudo /etc/slurm-llnl
仅供参考...我自己想出了步骤 3sudo chmod 777 /etc/slurm-llnl
sudo cat << EOF > /etc/slurm-llnl/slurm.conf
ClusterName=localcluster
SlurmctldHost=localhost
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
#
# COMPUTE NODES
NodeName=localhost CPUs=12 RealMemory=8000 State=UNKNOWN
PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=INFINITE State=UP
EOF
sudo systemctl start slurmctld
sudo systemctl start slurmd
现在,当我这样做时——
sudo scontrol update nodename=localhost state=idle
我收到错误 -
scontrol: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
scontrol: error: fetch_config: DNS SRV lookup failed
scontrol: error: _establish_config_source: failed to fetch config
scontrol: fatal: Could not establish a configuration source
编辑1-
我按照保罗的指示进行了操作。现在,我得到以下输出 -
(base) thoma@thoma-Lenovo-Legion-5-15IMH05H:/$ systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2024-03-05 05:57:17 CST; 2h 42min ago
Docs: man:slurmctld(8)
Main PID: 6509 (slurmctld)
Tasks: 10
Memory: 4.3M
CPU: 2.378s
CGroup: /system.slice/slurmctld.service
├─6509 /usr/sbin/slurmctld -D -s
└─6517 "slurmctld: slurmscriptd" "" ""
Mar 05 05:58:27 thoma-Lenovo-Legion-5-15IMH05H slurmctld[6509]: slurmctld: Invalid node state transition requested for node localhost from=INVAL to=IDLE
Mar 05 05:58:27 thoma-Lenovo-Legion-5-15IMH05H slurmctld[6509]: slurmctld: _slurm_rpc_update_node for localhost: Invalid node state specified
Mar 05 06:00:07 thoma-Lenovo-Legion-5-15IMH05H slurmctld[6509]: slurmctld: Invalid node state transition requested for node localhost from=INVAL to=IDLE
Mar 05 06:00:07 thoma-Lenovo-Legion-5-15IMH05H slurmctld[6509]: slurmctld: _slurm_rpc_update_node for localhost: Invalid node state specified
Mar 05 06:01:30 thoma-Lenovo-Legion-5-15IMH05H slurmctld[6509]: slurmctld: Invalid node state transition requested for node localhost from=INVAL to=RESUME
Mar 05 06:01:30 thoma-Lenovo-Legion-5-15IMH05H slurmctld[6509]: slurmctld: _slurm_rpc_update_node for localhost: Invalid node state specified
Mar 05 06:02:13 thoma-Lenovo-Legion-5-15IMH05H slurmctld[6509]: slurmctld: Invalid node state transition requested for node localhost from=INVAL to=RESUME
Mar 05 06:02:13 thoma-Lenovo-Legion-5-15IMH05H slurmctld[6509]: slurmctld: _slurm_rpc_update_node for localhost: Invalid node state specified
Mar 05 06:02:20 thoma-Lenovo-Legion-5-15IMH05H slurmctld[6509]: slurmctld: Invalid node state transition requested for node localhost from=INVAL to=IDLE
Mar 05 06:02:20 thoma-Lenovo-Legion-5-15IMH05H slurmctld[6509]: slurmctld: _slurm_rpc_update_node for localhost: Invalid node state specified
(base) thoma@thoma-Lenovo-Legion-5-15IMH05H:/$ systemctl status slurmd
● slurmd.service - Slurm node daemon
Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2024-03-05 05:57:17 CST; 2h 42min ago
Docs: man:slurmd(8)
Main PID: 6514 (slurmd)
Tasks: 1
Memory: 316.0K
CPU: 22ms
CGroup: /system.slice/slurmd.service
└─6514 /usr/sbin/slurmd -D -s
Mar 05 05:57:17 thoma-Lenovo-Legion-5-15IMH05H systemd[1]: Started Slurm node daemon.
Mar 05 05:57:17 thoma-Lenovo-Legion-5-15IMH05H slurmd[6514]: slurmd: error: Node configuration differs from hardware: CPUs=12:12(hw) Boards=1:1(hw) SocketsPerBoard=12:1(hw) CoresPerSocket=1:6(hw) ThreadsPerCore>
Mar 05 05:57:17 thoma-Lenovo-Legion-5-15IMH05H slurmd[6514]: slurmd: slurmd version 21.08.5 started
Mar 05 05:57:17 thoma-Lenovo-Legion-5-15IMH05H slurmd[6514]: slurmd: slurmd started on Tue, 05 Mar 2024 05:57:17 -0600
Mar 05 05:57:17 thoma-Lenovo-Legion-5-15IMH05H slurmd[6514]: slurmd: CPUs=12 Boards=1 Sockets=12 Cores=1 Threads=1 Memory=7838 TmpDisk=1252975 Uptime=372 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(>
lines 1-16/16 (END)
答案1
你也开始munge
服务了吗?
确保也按systemctl
如下方式启动它。
sudo systemctl start munge
sudo systemctl status munge
我建议你关注本指南我写了一篇关于如何在 Ubuntu 22.04 的单节点环境中安装 Slurm 的文章。
干杯。
答案2
通过查看您提供的 systemctl 配置,我可以告诉您以下内容:
1-至于含糊不清,您在 slurm.conf 中定义的硬件配置不正确。此配置将在其上运行的节点的硬件规格是什么?
(Mar 05 05:57:17 thoma-Lenovo-Legion-5-15IMH05H slurmd[6514]: slurmd: error: Node configuration differs from hardware: CPUs=12:12(hw) Boards=1:1(hw) SocketsPerBoard=12:1(hw) CoresPerSocket=1:6(hw) ThreadsPerCore>)
根据此输出,您的值每板插座数和每个插槽的核心数, 应该1和6分别。
2- 关于slurmctld,初始节点状态应该是未知, 像这样。
NodeName=localhost CPUs=12 RealMemory=30517 State=UNKNOWN PartitionName=localhost Nodes=ALL Default=YES MaxTime=INFINITE State=UP
注意:我看到你把“8000“ 身为你的真实记忆价值。尝试使用值“8192" 相反,Slurm 使用 MiB 值:)
然后尝试改变这些重新开始两个都含糊不清和slurmctld让我知道这是否有帮助。
干杯!