最终用户报告 Slurm 作业突然终止,日志中几乎没有任何内容可以查明原因

最终用户报告 Slurm 作业突然终止,日志中几乎没有任何内容可以查明原因

我在 slurm 中收到了一堆如下消息:

[2023-11-16T10:03:53.952] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=50461580 uid 1900007651
[2023-11-16T10:03:53.958] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=50521673_[2-183] uid 1900007651                                                                                       
[2023-11-16T10:03:53.958] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=50523246_[1-198] uid 1900007651                                                                                       
[2023-11-16T10:03:53.958] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=50522320_[278-377] uid 1900007651                                                                                     
[2023-11-16T10:03:53.958] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=50522600_[1-377] uid 1900007651                                                                                       
[2023-11-16T10:03:53.958] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=50527650_[13-71] uid 1900007651                                                                                       
[2023-11-16T10:03:53.958] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=50521231_[1-377] uid 1900007651                                                                                       
[2023-11-16T10:03:53.958] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=50463285_45214 uid 1900007651                                                                                         
[2023-11-16T10:03:53.958] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=50523630_[1] uid 1900007651                                                                                           
[2023-11-16T10:03:53.958] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=50519949_[156-377] uid 1900007651                                                                                     
[2023-11-16T10:03:53.962] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=50463285_45215 uid 1900007651                                                                                         
[2023-11-16T10:03:53.962] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=50463285_45216 uid 1900007651                                                                                         
[2023-11-16T10:03:53.963] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=50463285_45217 uid 1900007651                                                                                         
[2023-11-16T10:03:53.963] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=50463285_45219 uid 1900007651                                                                                         
[2023-11-16T10:03:53.963] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=50463285_45220 uid 1900007651

最终用户声称这些作业在搁置两天后被终止。除此之外,我找不到太多关于它们被终止的原因的信息。这段时间我在日志中唯一能找到的其他内容是一些关于累积时间的内容,但这不是 QoS 的一部分,而且据我所知,如果不将其用作 QoS 的一部分,则可以忽略它。

[2023-11-16T10:03:53.963] error: _remove_accrue_time_internal: QOS memlim acct qsg accrue_cnt underflow                                                                                     
[2023-11-16T10:03:53.963] error: _remove_accrue_time_internal: QOS memlim user 1900008578 accrue_cnt underflow 

对于为什么这个工作似乎突然终止,有什么想法吗?

相关内容