Ubuntu 18.04 进程在终止前挂起

Ubuntu 18.04 进程在终止前挂起

在 ubuntu 18.04 上使用 ABAQUS 6.14(还有 ABAQUS 2018)时,除了进程终止standard(执行隐式分析——如果你不熟悉这个也没关系)。

分析确实有效,因为在日志文件(.sta对于熟悉 abaqus 的人来说,是文件)中可以看到消息THE ANALYSIS HAS COMPLETED SUCCESSFULLY。输出数据库包含分析结果。但是,在分析完成后,该过程standard仍处于睡眠状态使用 0% CPU 并保持与运行时相同数量的 RAM。

strace我得到:

[pid 23191] close(8)                    = 0
[pid 23185] <... select resumed> )      = 0 (Timeout)
[pid 23185] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=50000} <unfinished ...>
[pid 23193] <... select resumed> )      = 0 (Timeout)
[pid 23193] futex(0x7f3acd917db0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 23191] futex(0x7f3acd917db0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 23193] <... futex resumed> )       = 0
[pid 23191] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 23191] futex(0x7f3acd917db0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 23193] select(7, [4 5 6], NULL, NULL, {tv_sec=0, tv_usec=20000} <unfinished ...>
[pid 23191] munmap(0x7f3ab130b000, 327680) = 0
[pid 23191] munmap(0x7f3ab136b000, 1114112) = 0
[pid 23191] munmap(0x7f3ab16db000, 1114112) = 0
[pid 23191] munmap(0x7f3ab0fbb000, 1114112) = 0
[pid 23191] munmap(0x7f3ab0ddb000, 1114112) = 0
[pid 23191] munmap(0x7f3ab0a0b000, 1114112) = 0
[pid 23191] munmap(0x7f3ab03fb000, 1114112) = 0
[pid 23191] munmap(0x7f3ab050b000, 1114112) = 0
[pid 23191] munmap(0x7f3ab00cb000, 1114112) = 0
[pid 23191] munmap(0x7f3ab02eb000, 1114112) = 0
[pid 23191] munmap(0x7f3ab14eb000, 1114112) = 0
[pid 23191] futex(0x7f3ab8a5dd44, FUTEX_WAIT_PRIVATE, 8, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 23191] futex(0x7f3ab8a5dd44, FUTEX_WAIT_PRIVATE, 12, NULL <unfinished ...>
[pid 23193] <... select resumed> )      = 0 (Timeout)
[pid 23193] select(7, [4 5 6], NULL, NULL, {tv_sec=0, tv_usec=20000}) = 0 (Timeout)
[pid 23193] select(7, [4 5 6], NULL, NULL, {tv_sec=0, tv_usec=20000} <unfinished ...>
[pid 23185] <... select resumed> )      = 0 (Timeout)
[pid 23185] select(10, [5 6 8 9], NULL, NULL, {tv_sec=0, tv_usec=20000} <unfinished ...>
[pid 23193] <... select resumed> )      = 0 (Timeout)
[pid 23193] select(7, [4 5 6], NULL, NULL, {tv_sec=0, tv_usec=20000} <unfinished ...>
[pid 23185] <... select resumed> )      = 0 (Timeout)
[pid 23185] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=50000} <unfinished ...>
[pid 23193] <... select resumed> )      = 0 (Timeout)
[pid 23193] select(7, [4 5 6], NULL, NULL, {tv_sec=0, tv_usec=20000}) = 0 (Timeout)
[pid 23193] select(7, [4 5 6], NULL, NULL, {tv_sec=0, tv_usec=20000} <unfinished ...>
[pid 23185] <... select resumed> )      = 0 (Timeout)
[pid 23185] select(10, [5 6 8 9], NULL, NULL, {tv_sec=0, tv_usec=20000} <unfinished ...>
[pid 23193] <... select resumed> )      = 0 (Timeout)
[pid 23193] select(7, [4 5 6], NULL, NULL, {tv_sec=0, tv_usec=20000} <unfinished ...>
[pid 23185] <... select resumed> )      = 0 (Timeout)
[pid 23185] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=50000} <unfinished ...>
[pid 23193] <... select resumed> )      = 0 (Timeout)
[pid 23193] select(7, [4 5 6], NULL, NULL, {tv_sec=0, tv_usec=20000}) = 0 (Timeout)
[pid 23193] select(7, [4 5 6], NULL, NULL, {tv_sec=0, tv_usec=20000}) = 0 (Timeout)
[pid 23193] select(7, [4 5 6], NULL, NULL, {tv_sec=0, tv_usec=20000} <unfinished ...>

就像两个进程处于死锁状态一样。此外,命令

pid -p 7002

pid -p 7010

确实会给出空输出。目录/proc/7002/proc/7010不存在。

唯一执行的与 abaqus 相关的进程是

david  6995  0.0  0.1 295428 51388 pts/0    S    17:00   0:00 /opt/abaqus/6.14-1/code/bin/python /opt/abaqus/6.14-1
david  6998  0.0  0.2 368744 97948 pts/0    S    17:00   0:00 /opt/abaqus/6.14-1/code/bin/python std_inst.com
david  7001  0.1  0.0 122076 20096 pts/0    Sl   17:00   0:03 /opt/abaqus/6.14-1/code/bin/eliT_DriverLM -job std_in
david  7008  0.4  0.5 735812 185364 pts/0   Sl   17:00   0:07 /opt/abaqus/6.14-1/code/bin/standard -standard -acade

在 ubuntu 16.04 上,完全相同的版本运行良好。以下是straceubuntu 16.04 上的情况(内核版本与我的 18.04 相同,即 4.15.0-29):

3890  close(8)                          = 0
3892  <... select resumed> )            = 0 (Timeout)
3892  futex(0x7f29e43e1db0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
3890  futex(0x7f29e43e1db0, FUTEX_WAKE_PRIVATE, 1) = 0
3892  <... futex resumed> )             = -1 EAGAIN (Resource temporarily unavailable)
3892  futex(0x7f29e43e1db0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
3890  futex(0x7f29e43e1db0, FUTEX_WAKE_PRIVATE, 1) = 0
3892  <... futex resumed> )             = -1 EAGAIN (Resource temporarily unavailable)
3892  futex(0x7f29e43e1db0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
3890  futex(0x7f29e43e1db0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
3892  <... futex resumed> )             = 0
3890  <... futex resumed> )             = -1 EAGAIN (Resource temporarily unavailable)
3890  futex(0x7f29e43e1db0, FUTEX_WAKE_PRIVATE, 1) = 0
3892  select(7, [4 5 6], NULL, NULL, {0, 20000} <unfinished ...>
3890  munmap(0x7f29c7adb000, 327680)    = 0
3890  munmap(0x7f29c7b3b000, 1114112)   = 0
3890  munmap(0x7f29c7eab000, 1114112)   = 0
3890  munmap(0x7f29c778b000, 1114112)   = 0
3890  munmap(0x7f29c75ab000, 1114112)   = 0
3890  munmap(0x7f29c71db000, 1114112)   = 0
3890  munmap(0x7f29c6bcb000, 1114112)   = 0
3890  munmap(0x7f29c6cdb000, 1114112)   = 0
3890  munmap(0x7f29c689b000, 1114112)   = 0
3890  munmap(0x7f29c6abb000, 1114112)   = 0
3890  munmap(0x7f29c7cbb000, 1114112)   = 0
3890  exit_group(0)                     = ?
3891  +++ exited with 0 +++
3893  +++ exited with 0 +++
3892  +++ exited with 0 +++
3890  +++ exited with 0 +++
3880  <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 3890
3880  --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3890, si_uid=1000, si_status=0, si_utime=107, si_stime=7} ---

有人有好的办法可以解决这个问题吗?或者我应该朝哪个方向进一步研究。

答案1

我找到了一个通过使用奇点容器来避免死锁的解决方案,正如 Will Furnass 在这里提出的:http://learningpatterns.me/posts-output/2018-01-30-abaqus-singularity/

虽然一开始有点复杂,但只要设置正确,它就会像魔法一样工作。我在主机系统 (Manjaro/Arch linux) 上修改了 abaqus 的别名,使它们指向 singularity 容器中的安装并在容器环境中执行命令。但是,由于我需要 Intel Fortran 编译器,我生成了一个基本的 centos 7 容器,然后对其进行了修改以安装编译器和 abaqus(在本例中为 v2019),而不是使用 Will Furnass 建议的 .def 脚本。

虽然设置需要一些时间,但现在我有一个可以在任何运行 singularity 的系统上使用的容器映像,这非常好 :)

编辑:我还测试了将工作安装复制到较新的 Linux 系统(并避免全新安装 abaqus),我可以确认这在我的情况下不起作用(CentOS 7 安装复制到 Manajaro 系统)。

答案2

达索系统本月发布了一个错误修复:

您需要更新至Abaqus 2018Abaqus 2018-HF16https://software.3ds.com/更多详情请访问https://github.com/willfurnass/abaqus-2017-centos-7-singularity/issues/5#issue-713025844

我尝试更新它Abaqus 2020Abaqus 2020-HF5它适用于 Ubuntu 20.04 和 Fedora 32。

答案3

我想介绍一下我针对这个问题的解决方法。我为 abq2018 求解器制作了一个 Python 包装器,用于检查 .sta 文件的完整性。一旦 .sta 文件完成,任何名为 standard 的进程都将被终止。我发现当 standard 被终止并且分析完成时,求解器会正常退出。

此解决方法并非完美的解决方案。此解决方法的当前问题

  1. 无法直接替换 abq2018 求解器调用
  2. 无法通过 GUI 运行,必须通过 shell 运行
  3. 仅解析 job= 参数
  4. 每次只能运行一个分析,因为所有标准进程被终止
  5. 如果未创建或修改 .sta 文件,abq 将永远挂起

如何使用此解决方法

  1. 创建名为 abq 的 Python 文件。abq 的代码详述如下。如果您使用的是 abq2018 以外的求解器,请将行 cmd = 'abq20xx.. 替换为您使用的求解器。
  2. 使 abq 可执行并在您的路径中可用。我将 abq 放在 Abaqus 命令文件夹中,然后运行chmod +x abq
  3. 通过执行 运行 Abaqus 标准作业abq job=Job-1。这将执行 Job-1.inp,然后在 Job-1.sta 完成后终止标准求解器。

abq 的代码如下

#!/usr/bin/python
import subprocess
import sys
import time
arguments = sys.argv
jobname = arguments[1].split('job=')[-1]
cmd = 'abq2018 cpus=4 ask_delete=OFF background job=' + jobname
p = subprocess.call(cmd, shell=True)

complete = False
termination_criteria = [' THE ANALYSIS HAS COMPLETED SUCCESSFULLY\n',
                        ' THE ANALYSIS HAS NOT BEEN COMPLETED\n']

while complete is False:
    # wait every 5 seconds
    time.sleep(5)
    try:
        with open(jobname + '.sta', 'r') as f:
            last = f.readlines()[-1]
            if last in termination_criteria:
                # this will kill any process named standard
                subprocess.call('pgrep standard | xargs kill', shell=True)
                complete = True
    except IOError:
        # model.sta has been deleted or doesn't exist
        # try again in 5 seconds
        time.sleep(5)

答案4

我在 Linux Mint 19 上也遇到了这个问题。Abaqus 6.14-5 安装在 Linux Mint 19 上。它无法自动终止,但从 .sta 文件可以看出,分析已完成。我认为这个问题与内核有关。顺便问一下,你现在找到解决方案了吗?

相关内容