torque pbs 4.0.1 作业保持排队(“Q”)状态;调度程序似乎没有收到任何通知

torque pbs 4.0.1 作业保持排队(“Q”)状态;调度程序似乎没有收到任何通知

我在集群环境中的 openSUSE 12.1 上使用 Torque 4.0.1。当我 qsub 一个作业(简单如“echo hello”)时,它保持“Q”状态,并且从未被调度。我可以使用 qrun 强制运行该作业,并且它在第一个节点上执行时没有错误。

这几天我一直在寻找解决方案,但都失败了。我读了手册、日志,甚至源代码,但还是找不到问题所在。当然,我在 Google 上搜索了很多,尝试了各种解决方案,但都没有奏效。

以下是一些可能有用的信息:

  • pbs_sched 正在运行,但其日志似乎表明它没有收到有关排队作业的通知。

    05/13/2012 18:55:08;0002; pbs_sched;Svr;Log;Log opened
    05/13/2012 18:55:08;0002; pbs_sched;Svr;TokenAct;Account file /var/spool/torque/sched_priv/accounting/20120513 opened
    05/13/2012 18:55:08;0002; pbs_sched;Svr;main;pbs_sched startup pid 32604
  • pbs_server 日志仅显示该作业已排队进入默认队列批次:

    05/13/2012 19:33:08;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 4.0.1, loglevel = 0
    05/13/2012 19:33:56;0100;PBS_Server;Job;16.head;enqueuing into batch, state 1 hop 1
    05/13/2012 19:33:56;0008;PBS_Server;Job;16.head;Job Queued at request of pubuser@head, owner = pubuser@head, job name = STDIN, queue = batch
  • qstat -f 16 没有显示任何有用的信息

    Job Id: 16.head
    Job_Name = STDIN
    Job_Owner = pubuser@head
    job_state = Q
    queue = batch
    server = head
    Checkpoint = u
    ctime = Sun May 13 19:33:56 2012
    Error_Path = head:/fserver/home/pubuser/STDIN.e16
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Sun May 13 19:33:56 2012
    Output_Path = head:/fserver/home/pubuser/STDIN.o16
    Priority = 0
    qtime = Sun May 13 19:33:56 2012
    Rerunable = True
    Resource_List.walltime = 01:00:00
    substate = 10
    Variable_List = PBS_O_QUEUE=batch,PBS_O_HOME=/,
        PBS_O_WORKDIR=/fserver/home/pubuser,PBS_O_HOST=head,PBS_O_SERVER=head,
        PBS_O_WORKDIR=/fserver/home/pubuser
    euser = pubuser
    egroup = users
    queue_rank = 4
    queue_type = E
    etime = Sun May 13 19:33:56 2012
    fault_tolerant = False
    job_radix = 0
    submit_host = head
    init_work_dir = /fserver/home/pubuser
  • 所有节点都是免费的:

    sun1
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910403,varattr=,jobs=,state=free,netload=44492032184,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1697420kb,totmem=1802616kb,idletime=241085,nusers=0,nsessions=0,uname=Linux sun1 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun2
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910408,varattr=,jobs=,state=free,netload=39762812881,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1701012kb,totmem=1802616kb,idletime=239982,nusers=0,nsessions=0,uname=Linux sun2 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun3
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910400,varattr=,jobs=,state=free,netload=45984311925,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1699772kb,totmem=1802616kb,idletime=212303,nusers=0,nsessions=0,uname=Linux sun3 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun4
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910407,varattr=,jobs=,state=free,netload=37538584401,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805480kb,totmem=1908308kb,idletime=211197,nusers=0,nsessions=0,uname=Linux sun4 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun5
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910411,varattr=,jobs=,state=free,netload=173547166,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1803816kb,totmem=1908308kb,idletime=211199,nusers=0,nsessions=0,uname=Linux sun5 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun6
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910411,varattr=,jobs=,state=free,netload=24641446,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805704kb,totmem=1908308kb,idletime=212999,nusers=0,nsessions=0,uname=Linux sun6 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun7
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910412,varattr=,jobs=,state=free,netload=1548383055,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805432kb,totmem=1908308kb,idletime=215630,nusers=0,nsessions=0,uname=Linux sun7 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun8
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910400,varattr=,jobs=,state=free,netload=128755968,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1803448kb,totmem=1908308kb,idletime=211866,nusers=0,nsessions=0,uname=Linux sun8 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun9
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910374,varattr=,jobs=,state=free,netload=1371896399,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805664kb,totmem=1908308kb,idletime=211161,nusers=0,nsessions=0,uname=Linux sun9 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0
  • qmgr-c‘ps’:

    #
    # Create queues and set their attributes.
    #
    #
    # Create and define queue batch
    #
    create queue batch

    set queue batch queue_type = Execution

    set queue batch resources_default.walltime = 01:00:00

    set queue batch enabled = True

    set queue batch started = True

    #
    # Set server attributes.
    #
    set server scheduling = True

    set server acl_hosts = head

    set server managers = pubuser@head

    set server managers += root@head

    set server operators = pubuser@head

    set server operators += root@head

    set server default_queue = batch

    set server log_events = 511

    set server mail_from = adm

    set server scheduler_iteration = 600

    set server node_check_rate = 150

    set server tcp_timeout = 300

    set server job_stat_rate = 45

    set server poll_jobs = True

    set server mom_job_sync = True

    set server keep_completed = 0

    set server submit_hosts = head

    set server next_job_number = 17

    set server moab_array_compatible = True
  • 第一个节点上的 momctl -d 13:

Host: sun1/sun1   Version: 4.0.1   PID: 5362
Server[0]: head (192.168.0.1:15001)
  Last Msg From Server:   1584 seconds (DeleteJob)
  Last Msg To Server:     7 seconds
HomeDirectory:          /var/spool/torque/mom_priv
stdout/stderr spool directory: '/var/spool/torque/spool/' (4457492 blocks available)
MOM active:             229485 seconds
Check Poll Time:        45 seconds
Server Update Interval: 45 seconds
LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model:    TCP
MemLocked:              TRUE  (mlock)
TCP Timeout:            0 seconds
Trusted Client List:  127.0.0.1:0,192.168.0.1:0,192.168.0.101:0,192.168.0.101:15003,192.168.0.102:15003,192.168.0.103:15003,192.168.0.104:15003,192.168.0.105:15003,192.168.0.106:15003,192.168.0.107:15003,192.168.0.108:15003,192.168.0.109:15003:  0
Copy Command:           /usr/bin/scp -rpB
NOTE:  no local jobs detected

diagnostics complete

问题是 TCP Timeout 为 0 秒,这似乎不正常。在诊断过程中,在 mom_logs 中发现以下日志


05/13/2012 20:30:10;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Resource temporarily unavailable (11) in tcp_read_proto_version, no protocol version number End of File (errno 2)

我用 Google 搜索了一下,但一无所获。

  • 我使用此 Torque 4.0.1(用于 tm 支持)编译了 OpenMPI,并且可以毫无问题地运行测试程序。

希望有人能解决这个问题。谢谢!

相关内容