我正在设置 slurm 22.05.6,慢慢构建集群。到目前为止,我已经设置了一台服务器vogon
和一个节点ceres
;这似乎工作正常 - 我可以开始工作srun
。服务器在 Debian 11 上,节点运行 Ubuntu 22.04,其 CPU 是 AMD:
root@ceres:~# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 9 5900X 12-Core Processor
CPU family: 25
Model: 33
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 1
Stepping: 2
Frequency boost: enabled
CPU max MHz: 4950.1948
CPU min MHz: 2200.0000
BogoMIPS: 7399.57
...
我现在已经设置了另一个节点,hathor
带有 Intel CPU:
root@hathor:~/slurm-22.05.6/etc# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 1
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 151
Model name: 12th Gen Intel(R) Core(TM) i9-12900KS
Stepping: 2
CPU MHz: 3400.000
CPU max MHz: 5500.0000
CPU min MHz: 800.0000
BogoMIPS: 6835.20
...
正如您所看到的,CPU 数量与核心倍数不匹配;根据man slurm.conf
,这应该可以用SlurmdParameters=config_overrides
- 另外,唯一重要的参数不应该是 CPU 数量吗?当我启动 slurmd 时,状态如下所示:
root@hathor:~/slurm-22.05.6/etc# systemctl status slurmd
● slurmd.service - Slurm node daemon
Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2022-11-24 13:49:01 GMT; 32min ago
Main PID: 124749 (slurmd)
Tasks: 1
Memory: 1.3M
CGroup: /system.slice/slurmd.service
└─124749 /usr/local/sbin/slurmd -D -s
Nov 24 13:49:01 hathor systemd[1]: Started Slurm node daemon.
Nov 24 13:49:01 hathor slurmd[124749]: slurmd: error: Thread count (24) not multiple of core count (16)
Nov 24 13:49:01 hathor slurmd[124749]: slurmd: Node configuration differs from hardware: CPUs=24:24(hw) Boards=1:1(hw) SocketsPerBoard=24:1(hw) CoresPerSocke>
Nov 24 13:49:01 hathor slurmd[124749]: slurmd: error: Thread count (24) not multiple of core count (16)
Nov 24 13:49:01 hathor slurmd[124749]: slurmd: slurmd version 22.05.6 started
Nov 24 13:49:01 hathor slurmd[124749]: slurmd: CPUs=24 Boards=1 Sockets=24 Cores=1 Threads=1 Memory=128530 TmpDisk=943 Uptime=8938 CPUSpecList=(null) Feature>
并且sinfo
仅列出ceres
:
root@hathor:~/slurm-22.05.6/etc# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up infinite 1 idle ceres
如果确实是 CPU 和内核之间不匹配,我可能可以在 BIOS 中禁用线程,但我宁愿不这样做。有解决方法吗?或者我应该在我的设置中寻找另一个问题?
编辑
我的slurm.conf
:
root@hathor:/var/log# cat /usr/local/etc/slurm.conf
ClusterName=comind
SlurmctldHost=vogon
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
#
# LOGGING AND ACCOUNTING
AccountingStorageHost=localhost
AccountingStoragePass="/var/run/munge/munge.socket.2"
AccountingStoragePort=3307
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm
AccountingStoreFlags=job_comment,job_script,job_env
JobCompHost=localhost
JobCompLoc=slurm_job_db
JobCompPass=Atauseq01
JobCompPort=3306
JobCompType=jobcomp/mysql
JobCompUser=slurm
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
#
# COMPUTE NODES
NodeName=ceres CPUs=24 RealMemory=100000 Sockets=1 CoresPerSocket=12 ThreadsPerCore=2 State=UNKNOWN
NodeName=hathor CPUs=24 RealMemory=120000 State=UNKNOWN
PartitionName=compute Nodes=ALL Default=YES MaxTime=INFINITE State=UP
答案1
只是一个非常简短的答案来展示我的解决方案 - 也许其他人会将其写为更详细的答案?我会接受这是最好的回复。
所以,事实证明它非常简单 - 只需PartitionName
通过替换ALL
节点列表来更改行,直观上它不太有意义 -ALL
应该意味着“所有节点” - 但它对我有用:
PartitionName=compute Nodes=ceres,hathor Default=YES MaxTime=INFINITE State=UP