我是 SLURM 新手。我正在尝试在新集群中配置 Slurm。
我有 4 个节点,每个节点有 14 个核心。我想以每个核心都可以独立运行的方式共享节点(即,node01 可以同时运行 14 个独立的串行作业),但任何核心都不应运行多个作业。通过查看文档,我认为我需要设置
SelectType = select/cons_res
SelectTypeParameters = CR_CORE
所以我在 中执行了此操作slurm.conf
并重新启动了slurmctld
。但是现在如果我提交作业,我会得到无法找到节点配置或作业最终处于 CG 状态的结果。
示例 1:
[sr@clstr mpitests]$ cat newHello.slrm
#!/bin/sh
#SBATCH --time=00:01:00
#SBATCH -N 1
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=4
module add shared openmpi/gcc/64 slurm
module load somesh/scripts/1.0
mpirun helloMPIf90
导致:
[sr@clstr mpitests]$ sbatch -v newHello.slrm
sbatch: defined options for program `sbatch'
sbatch: ----------------- ---------------------
sbatch: user : `sr'
sbatch: uid : 1003
sbatch: gid : 1003
sbatch: cwd : /home/sr/clusterTests/mpitests
sbatch: ntasks : 4 (set)
sbatch: nodes : 1-1
sbatch: jobid : 4294967294 (default)
sbatch: partition : default
sbatch: profile : `NotSet'
sbatch: job name : `newHello.slrm'
sbatch: reservation : `(null)'
sbatch: wckey : `(null)'
sbatch: distribution : unknown
sbatch: verbose : 1
sbatch: immediate : false
sbatch: overcommit : false
sbatch: time_limit : 1
sbatch: nice : -2
sbatch: account : (null)
sbatch: comment : (null)
sbatch: dependency : (null)
sbatch: qos : (null)
sbatch: constraints :
sbatch: geometry : (null)
sbatch: reboot : yes
sbatch: rotate : no
sbatch: network : (null)
sbatch: array : N/A
sbatch: cpu_freq_min : 4294967294
sbatch: cpu_freq_max : 4294967294
sbatch: cpu_freq_gov : 4294967294
sbatch: mail_type : NONE
sbatch: mail_user : (null)
sbatch: sockets-per-node : -2
sbatch: cores-per-socket : -2
sbatch: threads-per-core : -2
sbatch: ntasks-per-node : 4
sbatch: ntasks-per-socket : -2
sbatch: ntasks-per-core : -2
sbatch: mem_bind : default
sbatch: plane_size : 4294967294
sbatch: propagate : NONE
sbatch: switches : -1
sbatch: wait-for-switches : -1
sbatch: core-spec : NA
sbatch: burst_buffer : `(null)'
sbatch: remote command : `/home/sr/clusterTests/mpitests/newHello.slrm'
sbatch: power :
sbatch: wait : yes
sbatch: Consumable Resources (CR) Node Selection plugin loaded with argument 4
sbatch: Cray node selection plugin loaded
sbatch: Linear node selection plugin loaded with argument 4
sbatch: Serial Job Resource Selection plugin loaded with argument 4
sbatch: error: Batch job submission failed: Requested node configuration is not available
示例 2:
[sr@clstr mpitests]$ cat newHello.slrm
#!/bin/sh
#SBATCH --time=00:01:00
#SBATCH -N 1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
module add shared openmpi/gcc/64 slurm
module load somesh/scripts/1.0
helloMPIf90
导致:
[sr@clstr mpitests]$ sbatch -v newHello.slrm
sbatch: defined options for program `sbatch'
sbatch: ----------------- ---------------------
sbatch: user : `sr'
sbatch: uid : 1003
sbatch: gid : 1003
sbatch: cwd : /home/sr/clusterTests/mpitests
sbatch: ntasks : 1 (set)
sbatch: nodes : 1-1
sbatch: jobid : 4294967294 (default)
sbatch: partition : default
sbatch: profile : `NotSet'
sbatch: job name : `newHello.slrm'
sbatch: reservation : `(null)'
sbatch: wckey : `(null)'
sbatch: distribution : unknown
sbatch: verbose : 1
sbatch: immediate : false
sbatch: overcommit : false
sbatch: time_limit : 1
sbatch: nice : -2
sbatch: account : (null)
sbatch: comment : (null)
sbatch: dependency : (null)
sbatch: qos : (null)
sbatch: constraints :
sbatch: geometry : (null)
sbatch: reboot : yes
sbatch: rotate : no
sbatch: network : (null)
sbatch: array : N/A
sbatch: cpu_freq_min : 4294967294
sbatch: cpu_freq_max : 4294967294
sbatch: cpu_freq_gov : 4294967294
sbatch: mail_type : NONE
sbatch: mail_user : (null)
sbatch: sockets-per-node : -2
sbatch: cores-per-socket : -2
sbatch: threads-per-core : -2
sbatch: ntasks-per-node : 1
sbatch: ntasks-per-socket : -2
sbatch: ntasks-per-core : -2
sbatch: mem_bind : default
sbatch: plane_size : 4294967294
sbatch: propagate : NONE
sbatch: switches : -1
sbatch: wait-for-switches : -1
sbatch: core-spec : NA
sbatch: burst_buffer : `(null)'
sbatch: remote command : `/home/sr/clusterTests/mpitests/newHello.slrm'
sbatch: power :
sbatch: wait : yes
sbatch: Consumable Resources (CR) Node Selection plugin loaded with argument 4
sbatch: Cray node selection plugin loaded
sbatch: Linear node selection plugin loaded with argument 4
sbatch: Serial Job Resource Selection plugin loaded with argument 4
Submitted batch job 108
[sr@clstr mpitests]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
108 defq newHello sr CG 0:01 1 node001
[sr@clstr mpitests]$ scontrol show job=108
JobId=108 JobName=newHello.slrm
UserId=sr(1003) GroupId=sr(1003) MCS_label=N/A
Priority=4294901756 Nice=0 Account=(null) QOS=normal
JobState=COMPLETING Reason=NonZeroExitCode Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=1:0
RunTime=00:00:01 TimeLimit=00:01:00 TimeMin=N/A
SubmitTime=2017-03-03T18:25:51 EligibleTime=2017-03-03T18:25:51
StartTime=2017-03-03T18:26:01 EndTime=2017-03-03T18:26:02 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=defq AllocNode:Sid=clstr:20260
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node001
BatchHost=node001
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,node=1
Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/sr/clusterTests/mpitests/newHello.slrm
WorkDir=/home/sr/clusterTests/mpitests
StdErr=/home/sr/clusterTests/mpitests/slurm-108.out
StdIn=/dev/null
StdOut=/home/sr/clusterTests/mpitests/slurm-108.out
Power=
在第二个例子中,它保持 CG 状态直到我重置节点。
如果我将其重置slurm.conf
为SelectType=select/linear
,一切就会正常运行。
我完全搞不清楚自己哪里出错了。是与 slurm 配置有关,还是与我的 slurm 作业提交脚本有关,还是完全是其他原因。
如果有人能给我指明正确的方向,那将非常有帮助。
[注:我最初将其发布在 stackoverflow 上,但意识到超级用户可能是一个更好的论坛。]
答案1
看来我只需要重启整个集群!现在作业就可以正常运行了cons_res
。
这可能与文件系统问题有关,正如slurm 文档。