我的超级计算中心最近从 SGE 迁移到了 pbs/Torque。现在,当我安排阵列作业时,阵列中只有一半的作业得到安排。当它们完成后,另一半得到安排。尽管这些作业的利用率很高,但这种情况仍然会发生。
例如,我刚刚调度了一个包含 10 个作业的数组。这是 10 分钟后的 qstat 输出:
[myuserna@sub ~]$ qstat -t
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
3100[1].systemm2 ...-to-work.sh-1 myuserna 00:07:40 R short
3100[2].systemm2 ...-to-work.sh-2 myuserna 00:07:32 R short
3100[3].systemm2 ...-to-work.sh-3 myuserna 00:09:55 R short
3100[4].systemm2 ...-to-work.sh-4 myuserna 00:09:44 R short
3100[5].systemm2 ...-to-work.sh-5 myuserna 00:09:07 R short
3100[6].systemm2 ...-to-work.sh-6 myuserna 0 Q short
3100[7].systemm2 ...-to-work.sh-7 myuserna 0 Q short
3100[8].systemm2 ...-to-work.sh-8 myuserna 0 Q short
3100[9].systemm2 ...-to-work.sh-9 myuserna 0 Q short
3100[10].systemm2 ...to-work.sh-10 myuserna 0 Q short
[myuserna@sub ~]$
关于如何修复调度程序有什么线索吗?
以下是调度程序配置的相关部分:
create queue short
set queue short queue_type = Execution
set queue short Priority = 10000
set queue short max_user_queuable = 500
set queue short max_running = 200
set queue short resources_max.walltime = 24:00:00
set queue short resources_default.nodes = 1
set queue short max_user_run = 50
set queue short enabled = True
set queue short started = True
#
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = systemm2
set server acl_roots = root@*
set server managers = [email protected]
set server operators = [email protected]
set server default_queue = route
set server log_events = 511
set server mail_from = adm
set server resources_default.walltime = 01:00:00
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server mom_job_sync = True
set server keep_completed = 300
set server submit_hosts = submit-1
set server submit_hosts += submit-0
set server auto_node_np = True
set server next_job_number = 6217
set server max_job_array_size = 512
set server max_slot_limit = 5
答案1
请咨询您的管理员。可以限制每个用户每个队列使用的插槽数量。
更新:好的,现在您已更新问题以显示
set server max_slot_limit = 5
我确信这回答了这个问题。