slurm 服务再次运行失败。我不知道为什么

slurm 服务再次运行失败。我不知道为什么

我有一个主节点和两个从节点。
一个从节点连接成功,但一个节点连接失败。

每个节点都有 18.04 Ubuntu 和 17.11 Slurm

如果运行到systemctl status slurmd.service

我收到此错误:

slurmd.service - Slurm 节点守护程序 已加载:已加载(/lib/systemd/system/slurmd.service;已启用;供应商预设:已启用) 活动:失败(结果:退出代码)自 2019-10-15 星期二 15:28:22 KST;22 分钟前 文档:man:slurmd(8) 进程:27335 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS(代码=exited,状态=1/FAILURE) 主 PID:75036(代码=exited,状态=0/SUCCESS) 任务:1(限制:19660) CGroup:/system.slice/slurmd.service └─97690 /usr/sbin/slurmd -d /usr/sbin/slurmstepd

Oct 15 15:28:22 seok-System systemd[1]: Starting Slurm node daemon...
Oct 15 15:28:22 seok-System systemd[1]: slurmd.service: Control process exited, code=exited status=1
Oct 15 15:28:22 seok-System systemd[1]: slurmd.service: Failed with result 'exit-code'.
Oct 15 15:28:22 seok-System systemd[1]: Failed to start Slurm node daemon.

当我运行时slurmd -Dvvv我得到以下输出:

(null): log_init(): 无法打开日志文件`/var/log/slurmd.log': 权限被拒绝 slurmd: debug: 日志文件重新打开 slurmd: 消息聚合已禁用 slurmd: debug: init: Gres GPU 插件已加载 slurmd: Gres Name=gpu Type=gtx1080ti Count=1 slurmd: Gres Name=gpu Type=gtx1080ti Count=1 slurmd: gpu 设备编号 0(/dev/nvidia0):c 195:0 rwm slurmd: gpu 设备编号 1(/dev/nvidia1):c 195:1 rwm slurmd: 拓扑 NONE 插件已加载 slurmd: 路由默认插件已加载 slurmd: debug2: 收集 32 个 CPU 的 CPU 频率信息 slurmd: debug: 资源规范:此节点上默认未配置任何专用核心slurmd:调试:资源规范:未为此节点配置保留系统内存限制 slurmd:调试:读取 cgroup.conf 文件 /etc/slurm/cgroup.conf slurmd:调试:忽略过时的 CgroupReleaseAgentDir 选项。 slurmd:调试:读取 cgroup.conf 文件 /etc/slurm/cgroup.conf slurmd:调试:忽略过时的 CgroupReleaseAgentDir 选项。 slurmd:debug2:_file_write_content:无法打开“/sys/fs/cgroup/memory/memory.use_hierarchy”进行写入:权限被拒绝 slurmd:debug2:xcgroup_set_param:无法将“/sys/fs/cgroup/memory”的参数“memory.use_hierarchy”设置为“1” slurmd:debug:task/cgroup/memory:总计:128846M 允许:100%(强制),交换:0%(允许),最大值:100%(128846M)最大+交换:100%(257692M)最小值:30M kmem:100%(128846M 强制)最小值:30M swappiness:0(未设置) slurmd:debug:task/cgroup:现在限制作业分配的内存 slurmd:调试:任务/cgroup:已加载 slurmd:调试:Munge 身份验证插件已加载 slurmd:调试:spank:打开插件堆栈 /etc/slurm/plugstack.conf slurmd:Munge 加密签名插件已加载 slurmd:错误:chmod(/var/spool/slurmd,0755):操作不允许 slurmd:错误:无法初始化 slurmd spooldir slurmd:错误:slurmd 初始化失败

两个节点有相同的错误,但一个节点成功slurmd访问,一个节点失败

我检查了 munge、权限等等,但我不知道如何修复它?

这是我的slurm.conf

# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=master
ControlAddr=ip.ip.ip.ip
#BackupController=
#BackupAddr=
#
AuthType=auth/munge
AuthInfo=/var/run/munge/munge.socket.2
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
PluginDir=/usr/lib/slurm
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurm-llnl
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/cgroup
TaskPluginParam=Sched
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
DebugFlags=NO_CONF_HASH
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
GresTypes=gpu
NodeName=node1 Gres=gpu:pascal:1  NodeAddr=ip.ip.ip.ip CPUs=32 State=UNKNOWN CoresPerSocket=8 ThreadsPerCore=2 RealMemory=48209
NodeName=node2 Gres=gpu:pascal:2  NodeAddr=ip.ip.ip.ip CPUs=32 State=UNKNOWN CoresPerSocket=16 ThreadsPerCore=2 RealMemory=128846
PartitionName=Test Nodes=node1 Default=YES MaxTime=INFINITE State=UP
PartitionName=Test Nodes=node2 Default=YES MaxTime=INFINITE State=UP


编辑

/var/spool许可是drwxr-xr-x 8 root root 4096 Oct 15 14:58 spool

/var/spool/slurmd许可是drwxr-xr-x 2 slurm slurm 4096 Oct 15 14:58 slurmd

我曾经使用此命令sudo chmod 777 /var/spool /var/spool/slurmd来更改权限,但同样的错误,它不起作用。


编辑

这是我的 slurmd.log 文件:

 gpu device number 0(/dev/nvidia0):c 195:0 rwm
 gpu device number 1(/dev/nvidia1):c 195:1 rwm
 fatal: Unable to find slurmstepd file at /tmp/slurm-build/sbin/slurmstepd

我没有触摸过slurmstepd并且在哪里设置它?

相关内容