我有一块 Ryzen R9 5950x CPU(16 核/32 线程)、一块 Xeon Phi 7120p 卡以及 slurm.conf 中的分区/节点,定义如下:
NodeName=mic0 RealMemory=15000 Sockets=1 CoresPerSocket=61 ThreadsPerCore=4 State=UNKNOWN
PartitionName=compute Nodes=mic0 Default=YES MaxTime=INFINITE State=UP TRESBillingWeights="CPU=1.0,Mem=4.0G"
NodeName=amd RealMemory=10000 Sockets=1 CoresPerSocket=16 ThreadsPerCore=2 State=UNKNOWN
PartitionName=fast Nodes=amd Default=No MaxTime=INFINITE State=UP TRESBillingWeights="CPU=4.0,Mem=4.0G"
我想在 Ryzen CPU 的每个核心或线程上运行一个任务,但我的作业中的每个任务都可以访问所有 CPU 线程。例如,使用 进行作业分配后salloc -p fast -n 8 --threads-per-core=1 --mem=256mb
,将显示以下命令srun -l --cpu_bind=threads cat /proc/self/status | grep Cpus_allowed_list | sort -n
:
0: Cpus_allowed_list: 0-31
1: Cpus_allowed_list: 0-31
2: Cpus_allowed_list: 0-31
3: Cpus_allowed_list: 0-31
4: Cpus_allowed_list: 0-31
5: Cpus_allowed_list: 0-31
6: Cpus_allowed_list: 0-31
7: Cpus_allowed_list: 0-31
我希望一个任务只使用一个线程或最终核心。同样的问题也出现在salloc -p fast -n 8 --ntasks-per-core=1 --mem=256mb
与 Ryzen 相比,Xeon Phi 的一切运行都很好。
我该如何修复这个问题?slurm.conf 或作业分配行是否有错误?
slurm版本是21.08.8-2。操作系统是CentOS 7。
完整的 slurm.conf(它是一个非常小的“集群”,只是一个工作站):
ClusterName=cluster
SlurmctldHost=amd
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
PriorityType=priority/multifactor
PriorityWeightAge=1000
PriorityWeightFairshare=10000
PriorityWeightJobSize=1000
PriorityWeightPartition=1000
PriorityWeightQOS=1000 # don't use the qos factor
PriorityWeightTRES=CPU=1000,Mem=4000
PriorityFavorSmall=YES
AccountingStorageEnforce=associations,limits
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreFlags=job_comment
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
NodeName=mic0 RealMemory=15000 Sockets=1 CoresPerSocket=61 ThreadsPerCore=4 State=UNKNOWN
PartitionName=compute Nodes=mic0 Default=YES MaxTime=INFINITE State=UP TRESBillingWeights="CPU=1.0,Mem=4.0G"
#
NodeName=amd RealMemory=10000 Sockets=1 CoresPerSocket=16 ThreadsPerCore=2 State=UNKNOWN
PartitionName=fast Nodes=amd Default=No MaxTime=INFINITE State=UP TRESBillingWeights="CPU=4.0,Mem=4.0G"