SLURM 能够发送 CPU，但不能发送 GPU 作业

2024-6-2 • tag-icon

我已经构建了一个 slurm“集群”（目前只有一个作业服务器和单个计算服务器），并尝试在其上运行作业。我可以很好地运行 CPU 作业，它会将它们发送到机器并运行它们。但是，当我尝试运行 GPU 作业时，它永远不会将其发送到机器。根据 slurmctld 日志，它找到了该节点并认为它可用。它只是从不将作业发送到机器。

以下是配置文件和日志：

slurm.conf

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cluster1
SlurmctldHost=jobserver
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=67043328
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=lua
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=10000
#MaxStepCount=40000
#MaxTasksPerNode=512
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/affinity,task/cgroup
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageTRES=gres/gpu
AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageUser=
#AccountingStoreFlags=
#JobCompHost=
#JobCompLoc=
#JobCompParams=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=debug5
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=debug5
SlurmdLogFile=/var/log/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#DebugFlags=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
GresTypes=gpu
NodeName=cpu02 Gres=gpu:rtx_4000_sff:1 CPUs=112 Sockets=2 CoresPerSocket=28 ThreadsPerCore=2 RealMemory=772637 State=UNKNOWN

# Partitions
PartitionName=1gpu Nodes=cpu02 Default=YES MaxTime=INFINITE State=UP

（在生产中我会将 debug5 改为 info）

组配置文件

###
#
# Slurm cgroup support configuration file
#
# See man slurm.conf and man cgroup.conf for further
# information on cgroup configuration parameters
#--
CgroupPlugin=cgroup/v1
CgroupAutomount=yes

ConstrainCores=yes 
ConstrainDevices=yes
ConstrainRAMSpace=yes

gres.conf（在计算节点上）

# GPU list
NodeName=cpu02 Name=gpu Type=rtx_4000_sff File=/dev/nvidia0

sinfo-N-o“％N％G”

NODELIST GRES
cpu02 gpu:1(S:1)

日志

[2023-10-20T17:40:56.004] debug3: Writing job id 69 to header record of job_state file
[2023-10-20T17:40:56.054] debug2: Processing RPC: REQUEST_RESOURCE_ALLOCATION from UID=0
[2023-10-20T17:40:56.054] debug3: sched: Processing RPC: REQUEST_RESOURCE_ALLOCATION from uid=0
[2023-10-20T17:40:56.054] debug3: _set_hostname: Using auth hostname for alloc_node: jobserver
[2023-10-20T17:40:56.054] debug3: JobDesc: user_id=0 JobId=N/A partition=(null) name=neofetch
[2023-10-20T17:40:56.054] debug3:    cpus=1-4294967294 pn_min_cpus=-1 core_spec=-1
[2023-10-20T17:40:56.054] debug3:    Nodes=1-[1] Sock/Node=65534 Core/Sock=65534 Thread/Core=65534
[2023-10-20T17:40:56.054] debug3:    pn_min_memory_job=18446744073709551615 pn_min_tmp_disk=-1
[2023-10-20T17:40:56.054] debug3:    immediate=0 reservation=(null)
[2023-10-20T17:40:56.054] debug3:    features=(null) batch_features=(null) cluster_features=(null) prefer=(null)
[2023-10-20T17:40:56.054] debug3:    req_nodes=(null) exc_nodes=(null)
[2023-10-20T17:40:56.054] debug3:    time_limit=-1--1 priority=-1 contiguous=0 shared=-1
[2023-10-20T17:40:56.054] debug3:    kill_on_node_fail=-1 script=(null)
[2023-10-20T17:40:56.054] debug3:    argv="neofetch"
[2023-10-20T17:40:56.054] debug3:    stdin=(null) stdout=(null) stderr=(null)
[2023-10-20T17:40:56.054] debug3:    work_dir=/var/log alloc_node:sid=jobserver:2638
[2023-10-20T17:40:56.054] debug3:    power_flags=
[2023-10-20T17:40:56.054] debug3:    resp_host=127.0.0.1 alloc_resp_port=44165 other_port=35589
[2023-10-20T17:40:56.054] debug3:    dependency=(null) account=(null) qos=(null) comment=(null)
[2023-10-20T17:40:56.054] debug3:    mail_type=0 mail_user=(null) nice=0 num_tasks=-1 open_mode=0 overcommit=-1 acctg_freq=(null)
[2023-10-20T17:40:56.054] debug3:    network=(null) begin=Unknown cpus_per_task=-1 requeue=-1 licenses=(null)
[2023-10-20T17:40:56.054] debug3:    end_time= signal=0@0 wait_all_nodes=1 cpu_freq=
[2023-10-20T17:40:56.054] debug3:    ntasks_per_node=-1 ntasks_per_socket=-1 ntasks_per_core=-1 ntasks_per_tres=-1
[2023-10-20T17:40:56.054] debug3:    mem_bind=0:(null) plane_size:65534
[2023-10-20T17:40:56.054] debug3:    array_inx=(null)
[2023-10-20T17:40:56.054] debug3:    burst_buffer=(null)
[2023-10-20T17:40:56.054] debug3:    mcs_label=(null)
[2023-10-20T17:40:56.054] debug3:    deadline=Unknown
[2023-10-20T17:40:56.054] debug3:    bitflags=0x1e000000 delay_boot=4294967294
[2023-10-20T17:40:56.054] debug3:    TRES_per_job=gres:gpu:1
[2023-10-20T17:40:56.054] debug3: assoc_mgr_fill_in_user: found correct user: root(0)
[2023-10-20T17:40:56.054] debug5: assoc_mgr_fill_in_assoc: looking for assoc of user=root(0), acct=root, cluster=cluster1, partition=1gpu
[2023-10-20T17:40:56.054] debug3: assoc_mgr_fill_in_assoc: found correct association of user=root(0), acct=root, cluster=cluster1, partition=1gpu to assoc=2 acct=root
[2023-10-20T17:40:56.054] debug3: found correct qos
[2023-10-20T17:40:56.054] debug2: found 1 usable nodes from config containing cpu02
[2023-10-20T17:40:56.054] debug2: NodeSet for JobId=69
[2023-10-20T17:40:56.054] debug2: NodeSet[0] Nodes:cpu02 NodeWeight:1 Flags:0 FeatureBits:0 SchedWeight:511
[2023-10-20T17:40:56.054] debug3: _pick_best_nodes: JobId=69 idle_nodes 1 share_nodes 1
[2023-10-20T17:40:56.054] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69
[2023-10-20T17:40:56.054] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69
[2023-10-20T17:40:56.054] sched: _slurm_rpc_allocate_resources JobId=69 NodeList=(null) usec=253
[2023-10-20T17:40:58.518] debug:  sched: Running job scheduler for default depth.
[2023-10-20T17:40:58.518] debug2: found 1 usable nodes from config containing cpu02
[2023-10-20T17:40:58.518] debug2: NodeSet for JobId=69
[2023-10-20T17:40:58.518] debug2: NodeSet[0] Nodes:cpu02 NodeWeight:1 Flags:0 FeatureBits:0 SchedWeight:511
[2023-10-20T17:40:58.518] debug3: _pick_best_nodes: JobId=69 idle_nodes 1 share_nodes 1
[2023-10-20T17:40:58.518] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69
[2023-10-20T17:40:58.518] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69
[2023-10-20T17:40:58.518] debug3: sched: JobId=69. State=PENDING. Reason=Resources. Priority=4294901743. Partition=1gpu.
[2023-10-20T17:40:59.518] debug:  Spawning ping agent for cpu02
[2023-10-20T17:40:59.518] debug2: Spawning RPC agent for msg_type REQUEST_PING
[2023-10-20T17:40:59.518] debug2: Tree head got back 0 looking for 1
[2023-10-20T17:40:59.518] debug3: Tree sending to cpu02
[2023-10-20T17:40:59.520] debug2: Tree head got back 1
[2023-10-20T17:40:59.523] debug2: node_did_resp cpu02
[2023-10-20T17:41:01.004] debug3: create_mmap_buf: loaded file `/var/spool/slurmctld/job_state` as buf_t
[2023-10-20T17:41:01.004] debug3: Writing job id 70 to header record of job_state file
[2023-10-20T17:41:02.496] debug:  sched/backfill: _attempt_backfill: beginning
[2023-10-20T17:41:02.496] debug:  sched/backfill: _attempt_backfill: 1 jobs to backfill
[2023-10-20T17:41:02.496] debug2: sched/backfill: _attempt_backfill: entering _try_sched for JobId=69.
[2023-10-20T17:41:02.496] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69
[2023-10-20T17:41:02.496] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69
[2023-10-20T17:41:24.316] debug2: Processing RPC: REQUEST_JOB_INFO from UID=0
[2023-10-20T17:41:24.316] debug3: assoc_mgr_fill_in_user: found correct user: root(0)
[2023-10-20T17:41:24.316] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=0
[2023-10-20T17:41:24.316] debug2: _slurm_rpc_dump_partitions, size=221 usec=3
[2023-10-20T17:41:25.532] debug2: Testing job time limits and checkpoints
[2023-10-20T17:41:50.271] debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from UID=0
[2023-10-20T17:41:50.271] debug3: validate_node_specs: validating nodes cpu02 in state: IDLE
[2023-10-20T17:41:50.271] debug2: _slurm_rpc_node_registration complete for cpu02 usec=16
[2023-10-20T17:41:50.497] debug:  sched/backfill: _attempt_backfill: beginning
[2023-10-20T17:41:50.497] debug:  sched/backfill: _attempt_backfill: 1 jobs to backfill
[2023-10-20T17:41:50.497] debug2: sched/backfill: _attempt_backfill: entering _try_sched for JobId=69.
[2023-10-20T17:41:50.497] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69
[2023-10-20T17:41:50.497] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69
[2023-10-20T17:41:55.547] debug2: Testing job time limits and checkpoints
[2023-10-20T17:41:55.547] debug2: Performing purge of old job records
[2023-10-20T17:41:55.547] debug:  sched: Running job scheduler for full queue.
[2023-10-20T17:41:55.547] debug2: found 1 usable nodes from config containing cpu02
[2023-10-20T17:41:55.547] debug2: NodeSet for JobId=69
[2023-10-20T17:41:55.547] debug2: NodeSet[0] Nodes:cpu02 NodeWeight:1 Flags:0 FeatureBits:0 SchedWeight:511
[2023-10-20T17:41:55.547] debug3: _pick_best_nodes: JobId=69 idle_nodes 1 share_nodes 1
[2023-10-20T17:41:55.547] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69
[2023-10-20T17:41:55.547] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69
[2023-10-20T17:41:55.547] debug3: sched: JobId=69. State=PENDING. Reason=Resources. Priority=4294901743. Partition=1gpu.
[2023-10-20T17:41:56.570] debug2: Processing RPC: REQUEST_JOB_INFO from UID=0
[2023-10-20T17:41:56.570] debug3: assoc_mgr_fill_in_user: found correct user: root(0)
[2023-10-20T17:41:56.571] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=0
[2023-10-20T17:41:56.571] debug2: _slurm_rpc_dump_partitions, size=221 usec=4
[2023-10-20T17:41:57.522] debug2: Processing RPC: REQUEST_JOB_INFO from UID=0
[2023-10-20T17:41:57.522] debug3: assoc_mgr_fill_in_user: found correct user: root(0)
[2023-10-20T17:41:57.522] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=0
[2023-10-20T17:41:57.522] debug2: _slurm_rpc_dump_partitions, size=221 usec=4
[2023-10-20T17:42:15.822] debug2: Processing RPC: REQUEST_JOB_INFO from UID=0
[2023-10-20T17:42:15.822] debug3: assoc_mgr_fill_in_user: found correct user: root(0)
[2023-10-20T17:42:15.823] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=0
[2023-10-20T17:42:15.823] debug2: _slurm_rpc_dump_partitions, size=221 usec=4
[2023-10-20T17:42:20.497] debug:  sched/backfill: _attempt_backfill: beginning
[2023-10-20T17:42:20.497] debug:  sched/backfill: _attempt_backfill: 1 jobs to backfill
[2023-10-20T17:42:20.497] debug2: sched/backfill: _attempt_backfill: entering _try_sched for JobId=69.
[2023-10-20T17:42:20.497] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69
[2023-10-20T17:42:20.497] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69
[2023-10-20T17:42:25.563] debug2: Testing job time limits and checkpoints
[2023-10-20T17:42:55.579] debug2: Testing job time limits and checkpoints
[2023-10-20T17:42:55.579] debug2: Performing purge of old job records
[2023-10-20T17:42:55.579] debug:  sched: Running job scheduler for full queue.
[2023-10-20T17:42:55.579] debug2: found 1 usable nodes from config containing cpu02
[2023-10-20T17:42:55.579] debug2: NodeSet for JobId=69
[2023-10-20T17:42:55.579] debug2: NodeSet[0] Nodes:cpu02 NodeWeight:1 Flags:0 FeatureBits:0 SchedWeight:511
[2023-10-20T17:42:55.579] debug3: _pick_best_nodes: JobId=69 idle_nodes 1 share_nodes 1
[2023-10-20T17:42:55.579] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69
[2023-10-20T17:42:55.579] debug2: select/cons_tres: select_p_job_test: evaluating JobId=69
[2023-10-20T17:42:55.579] debug3: sched: JobId=69. State=PENDING. Reason=Resources. Priority=4294901743. Partition=1gpu.
[2023-10-20T17:43:25.594] debug2: Testing job time limits and checkpoints

slurmd.log

[2023-10-20T17:41:50.258] debug3: Trying to load plugin /usr/local/lib/slurm/gres_gpu.so
[2023-10-20T17:41:50.259] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Gres GPU plugin type:gres/gpu version:0x170206
[2023-10-20T17:41:50.259] debug:  gres/gpu: init: loaded
[2023-10-20T17:41:50.259] debug3: Success.
[2023-10-20T17:41:50.259] debug3: _merge_gres2: From gres.conf, using gpu:rtx_4000_sff:1:/dev/nvidia0
[2023-10-20T17:41:50.259] debug3: Trying to load plugin /usr/local/lib/slurm/gpu_generic.so
[2023-10-20T17:41:50.259] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:GPU Generic plugin type:gpu/generic version:0x170206
[2023-10-20T17:41:50.259] debug:  gpu/generic: init: init: GPU Generic plugin loaded
[2023-10-20T17:41:50.259] debug3: Success.
[2023-10-20T17:41:50.259] Gres Name=gpu Type=rtx_4000_sff Count=1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
[2023-10-20T17:41:50.259] debug3: Trying to load plugin /usr/local/lib/slurm/topology_none.so
[2023-10-20T17:41:50.259] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:topology NONE plugin type:topology/none version:0x170206
[2023-10-20T17:41:50.259] topology/none: init: topology NONE plugin loaded
[2023-10-20T17:41:50.259] debug3: Success.
[2023-10-20T17:41:50.259] debug3: Trying to load plugin /usr/local/lib/slurm/route_default.so
[2023-10-20T17:41:50.259] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:route default plugin type:route/default version:0x170206
[2023-10-20T17:41:50.259] route/default: init: route default plugin loaded
[2023-10-20T17:41:50.259] debug3: Success.
[2023-10-20T17:41:50.259] debug2: Gathering cpu frequency information for 112 cpus
[2023-10-20T17:41:50.262] debug:  Resource spec: No specialized cores configured by default on this node
[2023-10-20T17:41:50.262] debug:  Resource spec: Reserved system memory limit not configured for this node
[2023-10-20T17:41:50.262] debug3: Trying to load plugin /usr/local/lib/slurm/proctrack_cgroup.so
[2023-10-20T17:41:50.262] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Process tracking via linux cgroup freezer subsystem type:proctrack/cgroup version:0x170206
[2023-10-20T17:41:50.263] debug3: cgroup/v1: xcgroup_create_slurm_cg: slurm cgroup /slurm successfully created for ns freezer
[2023-10-20T17:41:50.263] debug3: Success.
[2023-10-20T17:41:50.263] debug3: Trying to load plugin /usr/local/lib/slurm/task_cgroup.so
[2023-10-20T17:41:50.263] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Tasks containment cgroup plugin type:task/cgroup version:0x170206
[2023-10-20T17:41:50.263] debug:  task/cgroup: init: Tasks containment cgroup plugin loaded
[2023-10-20T17:41:50.263] debug3: Success.
[2023-10-20T17:41:50.263] debug3: Trying to load plugin /usr/local/lib/slurm/task_affinity.so
[2023-10-20T17:41:50.263] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:task affinity plugin type:task/affinity version:0x170206
[2023-10-20T17:41:50.263] debug3: task/affinity: slurm_getaffinity: sched_getaffinity(0) = 0xffffffffffffffffffffffffffff
[2023-10-20T17:41:50.263] task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffffffffffffffff
[2023-10-20T17:41:50.263] debug3: Success.
[2023-10-20T17:41:50.263] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
[2023-10-20T17:41:50.263] debug3: Trying to load plugin /usr/local/lib/slurm/cred_munge.so
[2023-10-20T17:41:50.263] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Munge credential signature plugin type:cred/munge version:0x170206
[2023-10-20T17:41:50.263] cred/munge: init: Munge credential signature plugin loaded
[2023-10-20T17:41:50.263] debug3: Success.
[2023-10-20T17:41:50.263] debug3: slurmd initialization successful
[2023-10-20T17:41:50.265] slurmd version 23.02.6 started
[2023-10-20T17:41:50.265] debug3: finished daemonize
[2023-10-20T17:41:50.265] debug3: cred_unpack: job 66 ctime:1697822916 revoked:1697822916 expires:1697823036
[2023-10-20T17:41:50.265] debug3: not appending expired job 66 state
[2023-10-20T17:41:50.265] debug3: destroying job 66 state
[2023-10-20T17:41:50.265] debug3: Trying to load plugin /usr/local/lib/slurm/acct_gather_energy_none.so
[2023-10-20T17:41:50.265] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:AcctGatherEnergy NONE plugin type:acct_gather_energy/none version:0x170206
[2023-10-20T17:41:50.265] debug:  acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
[2023-10-20T17:41:50.265] debug3: Success.
[2023-10-20T17:41:50.265] debug3: Trying to load plugin /usr/local/lib/slurm/acct_gather_profile_none.so
[2023-10-20T17:41:50.265] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:AcctGatherProfile NONE plugin type:acct_gather_profile/none version:0x170206
[2023-10-20T17:41:50.265] debug:  acct_gather_profile/none: init: AcctGatherProfile NONE plugin loaded
[2023-10-20T17:41:50.265] debug3: Success.
[2023-10-20T17:41:50.265] debug3: Trying to load plugin /usr/local/lib/slurm/acct_gather_interconnect_none.so
[2023-10-20T17:41:50.266] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:AcctGatherInterconnect NONE plugin type:acct_gather_interconnect/none version:0x170206
[2023-10-20T17:41:50.266] debug:  acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
[2023-10-20T17:41:50.266] debug3: Success.
[2023-10-20T17:41:50.266] debug3: Trying to load plugin /usr/local/lib/slurm/acct_gather_filesystem_none.so
[2023-10-20T17:41:50.266] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:AcctGatherFilesystem NONE plugin type:acct_gather_filesystem/none version:0x170206
[2023-10-20T17:41:50.266] debug:  acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
[2023-10-20T17:41:50.266] debug3: Success.
[2023-10-20T17:41:50.266] debug2: No acct_gather.conf file (/etc/slurm/acct_gather.conf)
[2023-10-20T17:41:50.266] debug3: Trying to load plugin /usr/local/lib/slurm/jobacct_gather_cgroup.so
[2023-10-20T17:41:50.266] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Job accounting gather cgroup plugin type:jobacct_gather/cgroup version:0x170206
[2023-10-20T17:41:50.266] debug3: cgroup/v1: xcgroup_create_slurm_cg: slurm cgroup /slurm successfully created for ns memory
[2023-10-20T17:41:50.266] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'memory.use_hierarchy' set to '1' for '/sys/fs/cgroup/memory'
[2023-10-20T17:41:50.267] debug3: cgroup/v1: xcgroup_create_slurm_cg: slurm cgroup /slurm successfully created for ns cpuacct
[2023-10-20T17:41:50.267] debug:  jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
[2023-10-20T17:41:50.267] debug3: Success.
[2023-10-20T17:41:50.267] debug3: Trying to load plugin /usr/local/lib/slurm/job_container_none.so
[2023-10-20T17:41:50.267] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:job_container none plugin type:job_container/none version:0x170206
[2023-10-20T17:41:50.267] debug:  job_container/none: init: job_container none plugin loaded
[2023-10-20T17:41:50.267] debug3: Success.
[2023-10-20T17:41:50.267] debug3: Trying to load plugin /usr/local/lib/slurm/prep_script.so
[2023-10-20T17:41:50.267] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Script PrEp plugin type:prep/script version:0x170206
[2023-10-20T17:41:50.267] debug3: Success.
[2023-10-20T17:41:50.267] debug3: Trying to load plugin /usr/local/lib/slurm/core_spec_none.so
[2023-10-20T17:41:50.267] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Null core specialization plugin type:core_spec/none version:0x170206
[2023-10-20T17:41:50.267] debug3: Success.
[2023-10-20T17:41:50.267] debug3: Trying to load plugin /usr/local/lib/slurm/switch_none.so
[2023-10-20T17:41:50.268] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:switch NONE plugin type:switch/none version:0x170206
[2023-10-20T17:41:50.268] debug:  switch/none: init: switch NONE plugin loaded
[2023-10-20T17:41:50.268] debug3: Success.
[2023-10-20T17:41:50.268] debug3: Trying to load plugin /usr/local/lib/slurm/switch_cray_aries.so
[2023-10-20T17:41:50.268] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:switch Cray/Aries plugin type:switch/cray_aries version:0x170206
[2023-10-20T17:41:50.268] debug:  switch Cray/Aries plugin loaded.
[2023-10-20T17:41:50.268] debug3: Success.
[2023-10-20T17:41:50.268] debug:  MPI: Loading all types
[2023-10-20T17:41:50.268] debug3: Trying to load plugin /usr/local/lib/slurm/mpi_cray_shasta.so
[2023-10-20T17:41:50.268] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:mpi Cray Shasta plugin type:mpi/cray_shasta version:0x170206
[2023-10-20T17:41:50.268] debug3: Success.
[2023-10-20T17:41:50.268] debug3: Trying to load plugin /usr/local/lib/slurm/mpi_pmi2.so
[2023-10-20T17:41:50.268] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:mpi PMI2 plugin type:mpi/pmi2 version:0x170206
[2023-10-20T17:41:50.268] debug3: Success.
[2023-10-20T17:41:50.268] debug3: Trying to load plugin /usr/local/lib/slurm/mpi_none.so
[2023-10-20T17:41:50.268] debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:mpi none plugin type:mpi/none version:0x170206
[2023-10-20T17:41:50.268] debug3: Success.
[2023-10-20T17:41:50.268] debug2: No mpi.conf file (/etc/slurm/mpi.conf)
[2023-10-20T17:41:50.268] debug3: Successfully opened slurm listen port 6818
[2023-10-20T17:41:50.268] slurmd started on Fri, 20 Oct 2023 17:41:50 +0000
[2023-10-20T17:41:50.269] CPUs=112 Boards=1 Sockets=2 Cores=28 Threads=2 Memory=772637 TmpDisk=1855467 Uptime=77781 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2023-10-20T17:41:50.270] debug:  _handle_node_reg_resp: slurmctld sent back 11 TRES.
[2023-10-20T17:41:50.270] debug3: _registration_engine complete
[2023-10-20T17:44:19.623] debug3: in the service_connection
[2023-10-20T17:44:19.623] debug2: Start processing RPC: REQUEST_PING
[2023-10-20T17:44:19.623] debug2: Processing RPC: REQUEST_PING
[2023-10-20T17:44:19.624] debug2: Finish processing RPC: REQUEST_PING

cpu02 上没有运行任何作业，因此资源可用。当请求 GPU 时，作业服务器似乎没有与计算节点通信。

如能得到任何帮助解决此问题，我们将不胜感激。

答案1

修复方法是删除 GPU 的类型转换 (Type=rtx_4000_sff)。

答案1

相关内容