结论
两台具有相同数量 vCPU 的机器允许不同数量的 MPI 线程。为什么?
我正在运行两个 Ubuntu 实例:
- 深度学习基础 AMI(Ubuntu 18.04)版本 20.2(ami-0c8466c376c0d21e1)
- 深度学习 AMI (Ubuntu 18.04) 版本 25.3 (ami-0cfb96b24266ec1ce)
两者都有 32 个 vCPU,每个核心 16 个,每个核心 2 个线程。
AMI 实例 2 能够运行,mpirun -np 19 python3
但 AMI 实例 1 出现错误:
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 19
slots that were requested by the application:
python3
Either request fewer slots for your application, or make more slots
available for use.
为什么我在 AMI1 上只能运行 16 个 MPI 进程,但可以在 AMI2 上运行 19 个以上的 AMI 进程?
我跑了lscpu
:
AMI 1:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping: 1
CPU MHz: 1683.161
CPU max MHz: 3000.0000
CPU min MHz: 1200.0000
BogoMIPS: 4600.14
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 46080K
NUMA node0 CPU(s): 0-31
AMI2
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping: 1
CPU MHz: 2237.235
CPU max MHz: 3000.0000
CPU min MHz: 1200.0000
BogoMIPS: 4600.14
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 46080K
NUMA node0 CPU(s): 0-31
答案1
AMI 1 未以最高速度运行https://en.wikichip.org/wiki/intel/xeon_e5/e5-2686_v4 而 AMI 2 则不然。我会更换 CPU,看看问题是否继续存在;如果继续存在,则说明你的 CPU 有问题保修期内送回英特尔。