我目前有一个由 Slurm 管理的 10 个工作节点集群,其中有 1 个主节点。我之前曾成功设置过该集群,虽然遇到了一些初期问题,但最终还是成功让它运行起来。我将所有脚本和说明都放在了我的 GitHub 存储库中,如下所示:
https://brettchapman.github.io/Nimbus_Cluster
我最近需要重新开始增加硬盘空间,但现在无论我尝试什么,似乎都无法正确安装和配置它。
Slurmctld 和 slurmdbd 安装并配置正确(均处于活动状态并使用 systemctl status 命令运行),但 slurmd 仍然处于失败/非活动状态。
以下是我的 slurm.conf 文件:
由 configurator.html 生成的 slurm.conf 文件。
将此文件放在集群的所有节点上。
有关更多信息,请参阅 slurm.conf 手册页。
SlurmctldHost=节点-0 #SlurmctldHost=
#DisableRootJobs=NO #EnforcePartLimits=NO #Epilog= #EpilogSlurmctld= #FirstJobId=1 #MaxJobId=999999 #GresTypes= #GroupUpdateForce=0 #GroupUpdateTime=600 #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins=1 #KillOnBadExit=0 #LaunchType=launch/slurm #Licenses=foo*4,bar #MailProg=/bin/mail #MaxJobCount=5000 #MaxStepCount=40000 #MaxTasksPerNode=128 MpiDefault=none #MpiParams=ports=#-# #PluginDir= #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/cgroup #Prolog= #PrologFlags= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= #RebootProgram= ReturnToService=1 #SallocDefaultCommand= SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm #SlurmdUser=root #SrunEpilog= #SrunProlog= StateSaveLocation=/var/spool/slurm-llnl SwitchType=switch/none #TaskEpilog= TaskPlugin=task/cgroup #TaskPluginParam= #TaskProlog= #TopologyPlugin=topology/tree #TmpFS=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #使用PAM=0
定时器
#BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=30 #MessageTimeout=10 #ResvOverRun=0 MinJobAge=300 #OverTimeLimit=0 SlurmctldTimeout=120 SlurmdTimeout=600 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0
调度
#DefMemPerCPU=0 #MaxMemPerCPU=0 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_Core
工作优先级
#PriorityFlags= #PriorityType=priority/basic #PriorityDecayHalfLife= #PriorityCalcPeriod= #PriorityFavorSmall= #PriorityMaxAge= #PriorityUsageResetPeriod= #PriorityWeightAge= #PriorityWeightFairshare= #PriorityWeightJobSize= #PriorityWeightPartition= #PriorityWeightQOS=
记录和会计
#AccountingStorageEnforce=0 #AccountingStorageHost= #AccountingStorageLoc= #AccountingStoragePass= #AccountingStoragePort= AccountingStorageType=accounting_storage/filetxt #AccountingStorageUser= AccountingStoreJobComment=YES ClusterName=cluster #DebugFlags= JobCompHost=localhost JobCompLoc=slurm_acct_db JobCompPass=password #JobCompPort= JobCompType=jobcomp/mysql JobCompUser=slurm #JobContainerType=job_container/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurm-llnl/slurmd.log #SlurmSchedLogFile= #SlurmSchedLogLevel=
空闲节点的省电支持(可选)
#SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime=
计算节点
NodeName=node-[1-10] NodeAddr=node-[1-10] CPUs=16 RealMemory=64323 Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 状态=UNKNOWN PartitionName=debug Nodes=node-[1-10] 默认值=YES MaxTime=INFINITE 状态=UP
And the following is my slurmdbd.conf file:
AuthType=auth/munge
AuthInfo=/run/munge/munge.socket.2
DbdHost=localhost
DebugLevel=info
StorageHost=localhost
StorageLoc=slurm_acct_db
StoragePass=password
StorageType=accounting_storage/mysql
StorageUser=slurm
LogFile=/var/log/slurm-llnl/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
SlurmUser=slurm
在我的计算节点上运行 pdsh -a sudo systemctl status slurmd 出现以下错误:
pdsh@node-0: node-5: ssh exited with exit code 3
node-6: ● slurmd.service - Slurm node daemon
node-6: Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
node-6: Active: inactive (dead) since Tue 2020-08-11 03:52:58 UTC; 2min 45s ago
node-6: Docs: man:slurmd(8)
node-6: Process: 9068 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
node-6: Main PID: 8983
node-6:
node-6: Aug 11 03:34:09 node-6 systemd[1]: Starting Slurm node daemon...
node-6: Aug 11 03:34:09 node-6 systemd[1]: slurmd.service: Supervising process 8983 which is not our child. We'll most likely not notice when it exits.
node-6: Aug 11 03:34:09 node-6 systemd[1]: Started Slurm node daemon.
node-6: Aug 11 03:52:58 node-6 systemd[1]: slurmd.service: Killing process 8983 (n/a) with signal SIGKILL.
node-6: Aug 11 03:52:58 node-6 systemd[1]: slurmd.service: Killing process 8983 (n/a) with signal SIGKILL.
node-6: Aug 11 03:52:58 node-6 systemd[1]: slurmd.service: Succeeded.
pdsh@node-0: node-6: ssh exited with exit code 3
我以前在启动并运行集群时没有收到过这种类型的错误,所以我不确定从现在到上次运行它之间我做了什么或没做什么。我猜这与文件/文件夹权限有关,因为我发现这在设置时可能非常关键。我可能忘记记录我以前做过的事情。这是我第二次尝试设置 slurm 管理的集群。
您可以从我的 GitHub 存储库中跟踪我的整个工作流程和脚本。如果您需要任何其他错误输出,请询问。
感谢您提供任何帮助。
布雷特