在搭载 Raspbian 9.4 的 Raspberry Pi 集群中,Slurm 初始化失败

在搭载 Raspbian 9.4 的 Raspberry Pi 集群中,Slurm 初始化失败

我正在尝试设置泥浆在装有 Raspbian 9.4 的 Raspberry Pi 集群中。

我可以启动slurmctld,但是当我尝试启动时slurmd我得到以下输出:

pi@node1:~ $ slurmd -Dvvvc
slurmd: debug:  Log file re-opened
slurmd: error: Domain socket directory /SHARED/slurm/var/slurmd.node1: 
No such file or directory
slurmd: Message aggregation disabled
slurmd: topology NONE plugin loaded
slurmd: route default plugin loaded
slurmd: debug2: Gathering cpu frequency information for 4 cpus
slurmd: debug:  Resource spec: No specialized cores configured by default on this node
slurmd: debug:  Resource spec: Reserved system memory limit not configured for this node
slurmd: debug2: read_slurm_cgroup_conf: No cgroup.conf file (/SHARED/slurm/confdir/cgroup.conf)
slurmd: debug2: _file_read_content: unable to open '(null)/freezer//release_agent' for reading : No such file or directory
slurmd: debug2: xcgroup_get_param: unable to get parameter 'release_agent' for '(null)/freezer/'
slurmd: error: cgroup namespace 'freezer' not mounted. aborting
slurmd: error: unable to create freezer cgroup namespace
slurmd: error: Couldn't load specified plugin name for proctrack/cgroup: 
Plugin init() callback failed
slurmd: error: cannot create proctrack context for proctrack/cgroup
slurmd: error: slurmd initialization failed

我的配置文件是:

ClusterName=Cluster
ControlMachine=node1
SlurmUser=pi
SlurmdUser=pi
AuthType=auth/none
CryptoType=crypto/openssl
JobCredentialPrivateKey = /SHARED/slurm/confdir/slurm.key
JobCredentialPublicCertificate = /SHARED/slurm/confdir/slurm.cert
SlurmctldDebug=3
SlurmdDebug=3

StateSaveLocation=/SHARED/slurm/var
SlurmdSpoolDir=/SHARED/slurm/var/slurmd.%n
SlurmctldPidFile=/SHARED/slurm/var/slurmctld.pid
SlurmdPidFile=/SHARED/slurm/var/slurmd.%n.pid

FastSchedule=2
SlurmctldLogFile=/SHARED/slurm/var/slurmctld.log
SlurmdLogFile=/SHARED/slurm/var/slurmd.%n.log

NodeName=node1 CPUs=4 SocketsPerBoard=4 CoresPerSocket=1 
ThreadsPerCore=1 RealMemory=976 TmpDisk=8212

PartitionName=main Nodes=node1 Default=YES MaxTime=INFINITE State=UP

我错过了什么?

答案1

问题在于 cgroup 未正确配置。您可以配置它或更改 proc 跟踪类型。

在 slurm.conf 文件中应该有一个名为 ProctrackType 的变量。如果您想将其更改为无需 cgroup 即可运行的解决方案,可以将其设置为:

ProctrackType=proctrack/linuxproc

如果变量不存在,您也可以简单地添加此行。

相关内容