为什么非交互、占用大量内存的进程的速度取决于正在运行的进程数量（以及如何修复）？

Question 1

现在我明白出了什么问题，我知道这是硬件限制，而不是 UNIX 限制，因此这里不适合发帖。然而，我想我应该添加一些结束语。

我的内存有限的独立进程确实遇到了内存带宽问题。我在 Knights Landing 处理器上重复了该过程，并学习了如何在其本地 MCDRAM 上分配 Numpy 数组。使用本地内存，内存总线上没有争用，并且该进程继续扩展，远高于我在普通硬件上观察到的限制。

以下是在 MCDRAM（而不是普通 RAM）上分配 Numpy 数组的方法。

import ctypes
import numpy

def malloc_mcdram(size):
    libnuma = ctypes.cdll.LoadLibrary("libnuma.so")
    assert libnuma.numa_available() == 0   # NUMA not available is -1

    libnuma.numa_alloc_onnode.restype = ctypes.POINTER(ctypes.c_uint8)
    return libnuma.numa_alloc_onnode(ctypes.c_size_t(size), ctypes.c_int(1))

def custom_allocator_array(allocator, size, dtype):
    ptr = allocator(size)
    ptr.__array_interface__ = {"version": 3,
                               "typestr": numpy.ctypeslib._dtype(type(ptr.contents)).str,
                               "data": (ctypes.addressof(ptr.contents), False),
                               "shape": (size,)}
    return numpy.array(ptr, copy=False).view(dtype)

myarray = custom_allocator_array(malloc_mcdram, sizeInBytes, numpy.float64)

Answer

现在我明白出了什么问题，我知道这是硬件限制，而不是 UNIX 限制，因此这里不适合发帖。然而，我想我应该添加一些结束语。

我的内存有限的独立进程确实遇到了内存带宽问题。我在 Knights Landing 处理器上重复了该过程，并学习了如何在其本地 MCDRAM 上分配 Numpy 数组。使用本地内存，内存总线上没有争用，并且该进程继续扩展，远高于我在普通硬件上观察到的限制。

以下是在 MCDRAM（而不是普通 RAM）上分配 Numpy 数组的方法。

import ctypes
import numpy

def malloc_mcdram(size):
    libnuma = ctypes.cdll.LoadLibrary("libnuma.so")
    assert libnuma.numa_available() == 0   # NUMA not available is -1

    libnuma.numa_alloc_onnode.restype = ctypes.POINTER(ctypes.c_uint8)
    return libnuma.numa_alloc_onnode(ctypes.c_size_t(size), ctypes.c_int(1))

def custom_allocator_array(allocator, size, dtype):
    ptr = allocator(size)
    ptr.__array_interface__ = {"version": 3,
                               "typestr": numpy.ctypeslib._dtype(type(ptr.contents)).str,
                               "data": (ctypes.addressof(ptr.contents), False),
                               "shape": (size,)}
    return numpy.array(ptr, copy=False).view(dtype)

myarray = custom_allocator_array(malloc_mcdram, sizeInBytes, numpy.float64)

Question 2

你的进程是内存重的，而不是CPU重的。试试这个：

#!/usr/bin/env python

import datetime
import hashlib

data = "\0" * 64

ts_start = datetime.datetime.now()
for i in range(10000000):
    data = hashlib.sha512(data).digest()
ts_end = datetime.datetime.now()
print("Elapsed: %s" % (ts_end - ts_start))

在我的 2 插槽/8 核/16 线程机器上并行运行最多 8 个运行时，我得到了一致的结果，大约需要 20 秒才能完成。除此之外，当进程开始争夺 CPU 资源时，性能会下降。

单次运行：

~$ python cpuheavy.py 
Elapsed: 0:00:20.461652

8 个并行（= 每个核心 1 个），仍然是同一时间：

~$ for i in $(seq 8); do python cpuheavy.py & done
Elapsed: 0:00:18.979012
Elapsed: 0:00:19.092770
Elapsed: 0:00:19.873763
Elapsed: 0:00:20.139105
Elapsed: 0:00:20.147066
Elapsed: 0:00:20.181319
Elapsed: 0:00:21.328754
Elapsed: 0:00:21.495310

并行运行 16 次（= 每个超线程 1 次），随着进程开始争夺 CPU 时间，时间增加到约 31 秒。 Ca 时间增加 50%。

由于并行运行 32 个进程必须共享 CPU 线程，因此性能下降。每个流程的完成时间增加到 2 分钟以上（时间增加了 4 倍）。

Answer