Xeon Skylake SMP 的内存性能意外且无法解释地缓慢(且异常)

Xeon Skylake SMP 的内存性能意外且无法解释地缓慢(且异常)

我们一直在测试使用 2x Xeon Gold 6154 CPU、Supermicro X11DPH-I 主板和 96GB RAM 的服务器,与仅使用 1 个 CPU(一个插槽空)运行相比,发现内存存在一些非常奇怪的性能问题,类似的双 CPU Haswell Xeon E5-2687Wv3(用于这一系列测试,但其他 Broadwell 的表现类似),Broadwell-E i7s 和 Skylake-X i9s(用于比较)。

可以预料的是,在各种 memcpy 函数甚至内存分配方面,具有更快内存的 Skylake Xeon 处理器的性能会比 Haswell 更快(下面的测试中未涉及,因为我们找到了一种解决方法),但在同时安装两个 CPU 的情况下,Skylake Xeon 的性能几乎只有 Haswell Xeon 的一半,与 i7-6800k 相比甚至更低。更奇怪的是,当使用 Windows VirtualAllocExNuma 为内存分配分配 NUMA 节点时,虽然普通内存复制函数在远程节点上的性能预计比本地节点更差,但利用 SSE、MMX 和 AVX 寄存器的内存复制函数在远程 NUMA 节点上的执行速度比在本地节点上快得多(什么?)。如上所述,使用 Skylake Xeons,如果我们拉出 1 个 CPU,它的性能或多或少符合预期(仍然比 Haswell 慢一点,但差别不大)。

我不确定这是主板或 CPU 上的错误,还是 UPI 与 QPI 的问题,或者以上都不是,但似乎没有哪种 BIOS 设置组合可以解决这个问题。在 BIOS 中禁用 NUMA(未包含在测试结果中)确实可以提高使用 SSE、MMX 和 AVX 寄存器的所有复制功能的性能,但所有其他普通内存复制功能也会遭受巨大损失。

对于我们的测试程序,我们测试了使用内联汇编函数和_mm内部函数,除了汇编函数之外,我们使用 Windows 10 和 Visual Studio 2017 进行所有操作,因为 msvc++ 不会为 x64 编译 asm,我们使用 mingw/msys 中的 gcc 使用-c -O2标志编译 obj 文件,我们将其包含在 msvc++ 链接器中。

如果系统使用 NUMA 节点,我们将对每个 NUMA 节点使用 VirtualAllocExNuma 测试两个新的内存分配运算符,并为每个内存复制函数执行 100 个 16MB 的内存缓冲区复制的累计平均值,并且我们在每组测试之间轮换我们所处的内存分配。

所有 100 个源缓冲区和 100 个目标缓冲区都是 64 字节对齐的(为了使用流函数兼容高达 AVX512),并且初始化一次为源缓冲区的增量数据,以及 0xff 为目标缓冲区。

每种配置的每台机器上平均的副本数量是不同的,因为有些机器上的速度快得多,而有些机器上的速度慢得多。

结果如下:

Haswell Xeon E5-2687Wv3Supermicro X10DAi 上的 1 个 CPU(1 个空插槽),配备 32GB DDR4-2400(10c/20t,25 MB L3 缓存)。但请记住,基准测试会轮流通过 100 对 16MB 缓冲区,因此我们可能不会获得 L3 缓存命中。

---------------------------------------------------------------------------
Averaging 7000 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy                      averaging 2264.48 microseconds
asm_memcpy (asm)                 averaging 2322.71 microseconds
sse_memcpy (intrinsic)           averaging 1569.67 microseconds
sse_memcpy (asm)                 averaging 1589.31 microseconds
sse2_memcpy (intrinsic)          averaging 1561.19 microseconds
sse2_memcpy (asm)                averaging 1664.18 microseconds
mmx_memcpy (asm)                 averaging 2497.73 microseconds
mmx2_memcpy (asm)                averaging 1626.68 microseconds
avx_memcpy (intrinsic)           averaging 1625.12 microseconds
avx_memcpy (asm)                 averaging 1592.58 microseconds
avx512_memcpy (intrinsic)        unsupported on this CPU
rep movsb (asm)                  averaging 2260.6 microseconds

Haswell Dual Xeon E5-2687Wv3 2 CPU,搭载 Supermicro X10DAi,配备 64GB 内存

---------------------------------------------------------------------------
Averaging 6900 copies of 16MB of data per function for VirtualAllocExNuma to NUMA node 0(local)
---------------------------------------------------------------------------
std::memcpy                      averaging 3179.8 microseconds
asm_memcpy (asm)                 averaging 3177.15 microseconds
sse_memcpy (intrinsic)           averaging 1633.87 microseconds
sse_memcpy (asm)                 averaging 1663.8 microseconds
sse2_memcpy (intrinsic)          averaging 1620.86 microseconds
sse2_memcpy (asm)                averaging 1727.36 microseconds
mmx_memcpy (asm)                 averaging 2623.07 microseconds
mmx2_memcpy (asm)                averaging 1691.1 microseconds
avx_memcpy (intrinsic)           averaging 1704.33 microseconds
avx_memcpy (asm)                 averaging 1692.69 microseconds
avx512_memcpy (intrinsic)        unsupported on this CPU
rep movsb (asm)                  averaging 3185.84 microseconds
---------------------------------------------------------------------------
Averaging 6900 copies of 16MB of data per function for VirtualAllocExNuma to NUMA node 1
---------------------------------------------------------------------------
std::memcpy                      averaging 3992.46 microseconds
asm_memcpy (asm)                 averaging 4039.11 microseconds
sse_memcpy (intrinsic)           averaging 3174.69 microseconds
sse_memcpy (asm)                 averaging 3129.18 microseconds
sse2_memcpy (intrinsic)          averaging 3161.9 microseconds
sse2_memcpy (asm)                averaging 3141.33 microseconds
mmx_memcpy (asm)                 averaging 4010.17 microseconds
mmx2_memcpy (asm)                averaging 3211.75 microseconds
avx_memcpy (intrinsic)           averaging 3003.14 microseconds
avx_memcpy (asm)                 averaging 2980.97 microseconds
avx512_memcpy (intrinsic)        unsupported on this CPU
rep movsb (asm)                  averaging 3987.91 microseconds
---------------------------------------------------------------------------
Averaging 6900 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy                      averaging 3172.95 microseconds
asm_memcpy (asm)                 averaging 3173.5 microseconds
sse_memcpy (intrinsic)           averaging 1623.84 microseconds
sse_memcpy (asm)                 averaging 1657.07 microseconds
sse2_memcpy (intrinsic)          averaging 1616.95 microseconds
sse2_memcpy (asm)                averaging 1739.05 microseconds
mmx_memcpy (asm)                 averaging 2623.71 microseconds
mmx2_memcpy (asm)                averaging 1699.33 microseconds
avx_memcpy (intrinsic)           averaging 1710.09 microseconds
avx_memcpy (asm)                 averaging 1688.34 microseconds
avx512_memcpy (intrinsic)        unsupported on this CPU
rep movsb (asm)                  averaging 3175.14 microseconds

Skylake 至强金牌 6154Supermicro X11DPH-I 上的 1 个 CPU(1 个空插槽),配备 48GB DDR4-2666(18c/36t,24.75 MB L3 缓存)

---------------------------------------------------------------------------
Averaging 5000 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy                      averaging 1832.42 microseconds
asm_memcpy (asm)                 averaging 1837.62 microseconds
sse_memcpy (intrinsic)           averaging 1647.84 microseconds
sse_memcpy (asm)                 averaging 1710.53 microseconds
sse2_memcpy (intrinsic)          averaging 1645.54 microseconds
sse2_memcpy (asm)                averaging 1794.36 microseconds
mmx_memcpy (asm)                 averaging 2030.51 microseconds
mmx2_memcpy (asm)                averaging 1816.82 microseconds
avx_memcpy (intrinsic)           averaging 1686.49 microseconds
avx_memcpy (asm)                 averaging 1716.15 microseconds
avx512_memcpy (intrinsic)        averaging 1761.6 microseconds
rep movsb (asm)                  averaging 1977.6 microseconds

Supermicro X11DPH-I 上的 Skylake Xeon Gold 6154 2 CPU,配备 96GB DDR4-2666

---------------------------------------------------------------------------
Averaging 4100 copies of 16MB of data per function for VirtualAllocExNuma to NUMA node 0(local)
---------------------------------------------------------------------------
std::memcpy                      averaging 3131.6 microseconds
asm_memcpy (asm)                 averaging 3070.57 microseconds
sse_memcpy (intrinsic)           averaging 3297.72 microseconds
sse_memcpy (asm)                 averaging 3423.38 microseconds
sse2_memcpy (intrinsic)          averaging 3274.31 microseconds
sse2_memcpy (asm)                averaging 3413.48 microseconds
mmx_memcpy (asm)                 averaging 2069.53 microseconds
mmx2_memcpy (asm)                averaging 3694.91 microseconds
avx_memcpy (intrinsic)           averaging 3118.75 microseconds
avx_memcpy (asm)                 averaging 3224.36 microseconds
avx512_memcpy (intrinsic)        averaging 3156.56 microseconds
rep movsb (asm)                  averaging 3155.36 microseconds
---------------------------------------------------------------------------
Averaging 4100 copies of 16MB of data per function for VirtualAllocExNuma to NUMA node 1
---------------------------------------------------------------------------
std::memcpy                      averaging 5309.77 microseconds
asm_memcpy (asm)                 averaging 5330.78 microseconds
sse_memcpy (intrinsic)           averaging 2350.61 microseconds
sse_memcpy (asm)                 averaging 2402.57 microseconds
sse2_memcpy (intrinsic)          averaging 2338.61 microseconds
sse2_memcpy (asm)                averaging 2475.51 microseconds
mmx_memcpy (asm)                 averaging 2883.97 microseconds
mmx2_memcpy (asm)                averaging 2517.69 microseconds
avx_memcpy (intrinsic)           averaging 2356.07 microseconds
avx_memcpy (asm)                 averaging 2415.22 microseconds
avx512_memcpy (intrinsic)        averaging 2487.01 microseconds
rep movsb (asm)                  averaging 5372.98 microseconds
---------------------------------------------------------------------------
Averaging 4100 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy                      averaging 3075.1 microseconds
asm_memcpy (asm)                 averaging 3061.97 microseconds
sse_memcpy (intrinsic)           averaging 3281.17 microseconds
sse_memcpy (asm)                 averaging 3421.38 microseconds
sse2_memcpy (intrinsic)          averaging 3268.79 microseconds
sse2_memcpy (asm)                averaging 3435.76 microseconds
mmx_memcpy (asm)                 averaging 2061.27 microseconds
mmx2_memcpy (asm)                averaging 3694.48 microseconds
avx_memcpy (intrinsic)           averaging 3111.16 microseconds
avx_memcpy (asm)                 averaging 3227.45 microseconds
avx512_memcpy (intrinsic)        averaging 3148.65 microseconds
rep movsb (asm)                  averaging 2967.45 microseconds

Skylake-X i9-7940X华硕 ROG Rampage VI Extreme 配备 32GB DDR4-4266(14c/28t,19.25 MB L3 缓存)(超频至 3.8GHz/4.4GHz 睿频,DDR 为 4040MHz,目标 AVX 频率 3737MHz,目标 AVX-512 频率 3535MHz,目标缓存频率 2424MHz)

---------------------------------------------------------------------------
Averaging 6500 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy                      averaging 1750.87 microseconds
asm_memcpy (asm)                 averaging 1748.22 microseconds
sse_memcpy (intrinsic)           averaging 1743.39 microseconds
sse_memcpy (asm)                 averaging 3120.18 microseconds
sse2_memcpy (intrinsic)          averaging 1743.37 microseconds
sse2_memcpy (asm)                averaging 2868.52 microseconds
mmx_memcpy (asm)                 averaging 2255.17 microseconds
mmx2_memcpy (asm)                averaging 3434.58 microseconds
avx_memcpy (intrinsic)           averaging 1698.49 microseconds
avx_memcpy (asm)                 averaging 2840.65 microseconds
avx512_memcpy (intrinsic)        averaging 1670.05 microseconds
rep movsb (asm)                  averaging 1718.77 microseconds

Broadwell i7-6800k在配备 24GB DDR4-2400 的 ASUS X99 上(6c/12t,15 MB L3 缓存)

---------------------------------------------------------------------------
Averaging 64900 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy                      averaging 2522.1 microseconds
asm_memcpy (asm)                 averaging 2615.92 microseconds
sse_memcpy (intrinsic)           averaging 1621.81 microseconds
sse_memcpy (asm)                 averaging 1669.39 microseconds
sse2_memcpy (intrinsic)          averaging 1617.04 microseconds
sse2_memcpy (asm)                averaging 1719.06 microseconds
mmx_memcpy (asm)                 averaging 3021.02 microseconds
mmx2_memcpy (asm)                averaging 1691.68 microseconds
avx_memcpy (intrinsic)           averaging 1654.41 microseconds
avx_memcpy (asm)                 averaging 1666.84 microseconds
avx512_memcpy (intrinsic)        unsupported on this CPU
rep movsb (asm)                  averaging 2520.13 microseconds

汇编函数源自 xine-libs 中的 fast_memcpy,主要用于与 msvc++ 的优化器进行比较。

测试源代码可在以下位置获取:https://github.com/marcmicalizzi/memcpy_test(文章写得有点长)

还有其他人遇到过这种情况吗?或者有人知道为什么会发生这种情况吗?


更新时间 2018-05-15 13:40EST

因此,根据 Peter Cordes 的建议,我更新了测试,比较预取与未预取、NT 存储与常规存储,并调整了每个函数中完成的预取(我没有编写预取功能的任何有意义的经验,所以如果我犯了任何错误,请告诉我,我会相应地调整测试。预取确实有影响,所以至少它做了一些事情)。这些更改反映在最新修订中,该修订来自我之前为任何寻找源代码的人提供的 GitHub 链接。

我还添加了一个 SSE4.1 memcpy,因为在 SSE4.1 之前我找不到任何_mm_stream_load(我专门使用的_mm_stream_load_si128)SSE 函数,所以sse_memcpysse2_memcpy不能完全使用 NT 存储,并且该avx_memcpy函数使用 AVX2 函数进行流加载。

我选择不对纯存储和纯加载访问模式进行测试,因为我不确定纯存储是否有意义,因为如果没有对它访问的寄存器进行加载,数据将毫无意义且无法验证。

新测试的有趣结果是,在 Xeon Skylake 双插槽设置上,仅有的在这种设置下,对于 16MB 内存复制,存储函数实际上比 NT 流函数快得多。此外仅有的在该设置下(并且仅在 BIOS 中启用 LLC 预取),prefetchnta 在某些测试(SSE、SSE4.1)中的表现优于 prefetcht0 和无预取。

这个新测试的原始结果太长,无法添加到帖子中,因此它们发布在与源代码相同的 git 存储库中results-2018-05-15

我仍然不明白为什么对于流式 NT 存储,远程 NUMA 节点在 Skylake SMP 设置下速度更快,尽管使用常规存储仍然比本地 NUMA 节点上的速度更快

答案1

您的内存等级是否不正确?也许当您添加第二个 CPU 时,您的主板的内存等级出现了一些奇怪的问题?我知道当您拥有四 CPU 机器时,他们会做各种奇怪的事情来使内存正常工作,如果您的内存等级不正确,有时它会工作,但时钟会回到 1/4 或 1/2 的速度。也许 SuperMicro 在该主板上做了一些事情,使 DDR4 和双 CPU 变成四通道,并且它使用了类似的数学运算。等级不正确 == 1/2 速度。

相关内容