具有相同 vCPU 数量的计算机,一个出现 MPI 错误,另一个不出现

具有相同 vCPU 数量的计算机,一个出现 MPI 错误,另一个不出现

结论

两台具有相同数量 vCPU 的机器允许不同数量的 MPI 线程。为什么?


我正在运行两个 Ubuntu 实例:

  1. 深度学习基础 AMI(Ubuntu 18.04)版本 20.2(ami-0c8466c376c0d21e1)
  2. 深度学习 AMI (Ubuntu 18.04) 版本 25.3 (ami-0cfb96b24266ec1ce)

两者都有 32 个 vCPU,每个核心 16 个,每个核心 2 个线程。

AMI 实例 2 能够运行,mpirun -np 19 python3但 AMI 实例 1 出现错误:

--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 19
slots that were requested by the application:

  python3

Either request fewer slots for your application, or make more slots
available for use.

为什么我在 AMI1 上只能运行 16 个 MPI 进程,但可以在 AMI2 上运行 19 个以上的 AMI 进程?


我跑了lscpu

AMI 1:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  16
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:            1
CPU MHz:             1683.161
CPU max MHz:         3000.0000
CPU min MHz:         1200.0000
BogoMIPS:            4600.14
Hypervisor vendor:   Xen
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            46080K
NUMA node0 CPU(s):   0-31

AMI2

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  16
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:            1
CPU MHz:             2237.235
CPU max MHz:         3000.0000
CPU min MHz:         1200.0000
BogoMIPS:            4600.14
Hypervisor vendor:   Xen
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            46080K
NUMA node0 CPU(s):   0-31

答案1

AMI 1 未以最高速度运行https://en.wikichip.org/wiki/intel/xeon_e5/e5-2686_v4 而 AMI 2 则不然。我会更换 CPU,看看问题是否继续存在;如果继续存在,则说明你的 CPU 有问题保修期内送回英特尔

相关内容