我需要一些帮助来确定我在服务器上的 Linux 下看到的内存带宽是否正常。以下是服务器规格:
HP ProLiant DL165 G7
2x AMD Opteron 6164 HE 12-Core
40 GB RAM (10 x 4GB DDR1333)
Debian 6.0
在此服务器上使用mbw
我得到以下数字:
foo1:~# mbw -n 3 1024
Long uses 8 bytes. Allocating 2*134217728 elements = 2147483648 bytes of memory.
Using 262144 bytes as blocks for memcpy block copy test.
Getting down to business... Doing 3 runs per test.
0 Method: MEMCPY Elapsed: 0.58047 MiB: 1024.00000 Copy: 1764.082 MiB/s
1 Method: MEMCPY Elapsed: 0.58012 MiB: 1024.00000 Copy: 1765.152 MiB/s
2 Method: MEMCPY Elapsed: 0.58010 MiB: 1024.00000 Copy: 1765.201 MiB/s
AVG Method: MEMCPY Elapsed: 0.58023 MiB: 1024.00000 Copy: 1764.811 MiB/s
0 Method: DUMB Elapsed: 0.36174 MiB: 1024.00000 Copy: 2830.778 MiB/s
1 Method: DUMB Elapsed: 0.35869 MiB: 1024.00000 Copy: 2854.817 MiB/s
2 Method: DUMB Elapsed: 0.35848 MiB: 1024.00000 Copy: 2856.481 MiB/s
AVG Method: DUMB Elapsed: 0.35964 MiB: 1024.00000 Copy: 2847.310 MiB/s
0 Method: MCBLOCK Elapsed: 0.23546 MiB: 1024.00000 Copy: 4348.860 MiB/s
1 Method: MCBLOCK Elapsed: 0.23544 MiB: 1024.00000 Copy: 4349.230 MiB/s
2 Method: MCBLOCK Elapsed: 0.23544 MiB: 1024.00000 Copy: 4349.359 MiB/s
AVG Method: MCBLOCK Elapsed: 0.23545 MiB: 1024.00000 Copy: 4349.149 MiB/s
在我的另一台服务器上(基于 Intel Xeon E3-1270):
foo2:~# mbw -n 3 1024
Long uses 8 bytes. Allocating 2*134217728 elements = 2147483648 bytes of memory.
Using 262144 bytes as blocks for memcpy block copy test.
Getting down to business... Doing 3 runs per test.
0 Method: MEMCPY Elapsed: 0.18960 MiB: 1024.00000 Copy: 5400.901 MiB/s
1 Method: MEMCPY Elapsed: 0.18922 MiB: 1024.00000 Copy: 5411.690 MiB/s
2 Method: MEMCPY Elapsed: 0.18944 MiB: 1024.00000 Copy: 5405.491 MiB/s
AVG Method: MEMCPY Elapsed: 0.18942 MiB: 1024.00000 Copy: 5406.024 MiB/s
0 Method: DUMB Elapsed: 0.14838 MiB: 1024.00000 Copy: 6901.200 MiB/s
1 Method: DUMB Elapsed: 0.14818 MiB: 1024.00000 Copy: 6910.561 MiB/s
2 Method: DUMB Elapsed: 0.14820 MiB: 1024.00000 Copy: 6909.628 MiB/s
AVG Method: DUMB Elapsed: 0.14825 MiB: 1024.00000 Copy: 6907.127 MiB/s
0 Method: MCBLOCK Elapsed: 0.04362 MiB: 1024.00000 Copy: 23477.623 MiB/s
1 Method: MCBLOCK Elapsed: 0.04262 MiB: 1024.00000 Copy: 24025.151 MiB/s
2 Method: MCBLOCK Elapsed: 0.04258 MiB: 1024.00000 Copy: 24048.849 MiB/s
AVG Method: MCBLOCK Elapsed: 0.04294 MiB: 1024.00000 Copy: 23847.599 MiB/s
作为参考,以下是我在基于英特尔的笔记本电脑上获得的结果:
laptop:~$ mbw -n 3 1024
Long uses 8 bytes. Allocating 2*134217728 elements = 2147483648 bytes of memory.
Using 262144 bytes as blocks for memcpy block copy test.
Getting down to business... Doing 3 runs per test.
0 Method: MEMCPY Elapsed: 0.40566 MiB: 1024.00000 Copy: 2524.269 MiB/s
1 Method: MEMCPY Elapsed: 0.38458 MiB: 1024.00000 Copy: 2662.638 MiB/s
2 Method: MEMCPY Elapsed: 0.38876 MiB: 1024.00000 Copy: 2634.043 MiB/s
AVG Method: MEMCPY Elapsed: 0.39300 MiB: 1024.00000 Copy: 2605.600 MiB/s
0 Method: DUMB Elapsed: 0.30707 MiB: 1024.00000 Copy: 3334.745 MiB/s
1 Method: DUMB Elapsed: 0.30425 MiB: 1024.00000 Copy: 3365.653 MiB/s
2 Method: DUMB Elapsed: 0.30342 MiB: 1024.00000 Copy: 3374.849 MiB/s
AVG Method: DUMB Elapsed: 0.30491 MiB: 1024.00000 Copy: 3358.328 MiB/s
0 Method: MCBLOCK Elapsed: 0.07875 MiB: 1024.00000 Copy: 13003.670 MiB/s
1 Method: MCBLOCK Elapsed: 0.08374 MiB: 1024.00000 Copy: 12228.034 MiB/s
2 Method: MCBLOCK Elapsed: 0.07635 MiB: 1024.00000 Copy: 13411.216 MiB/s
AVG Method: MCBLOCK Elapsed: 0.07961 MiB: 1024.00000 Copy: 12862.006 MiB/s
因此根据mbw
我的笔记本电脑比服务器快 3 倍!请帮我解释一下。我也尝试过安装一个 RAM 磁盘并使用 dd 对其进行基准测试,结果也出现了类似的差异,所以我认为这不是mbw
问题所在。
我检查了 BIOS 设置,内存似乎全速运行。根据托管公司的说法,模块都正常。
这可能与 NUMA 有关吗?似乎此服务器上已禁用节点交错。启用它(从而关闭 NUMA)会有什么不同吗?
foo1:~# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5
node 0 size: 8190 MB
node 0 free: 7898 MB
node 1 cpus: 6 7 8 9 10 11
node 1 size: 12288 MB
node 1 free: 12073 MB
node 2 cpus: 18 19 20 21 22 23
node 2 size: 12288 MB
node 2 free: 12034 MB
node 3 cpus: 12 13 14 15 16 17
node 3 size: 8192 MB
node 3 free: 8032 MB
node distances:
node 0 1 2 3
0: 10 20 20 20
1: 20 10 20 20
2: 20 20 10 20
3: 20 20 20 10
更新:
已禁用 NUMA(Linux 启动时 numa=off)并在 BIOS 中禁用 ECC。无变化,仍为与上述相同的数字。
更新2:
以下是内存的布局dmidecode
:
PROC 1 DIMM 1
PROC 1 DIMM 4
PROC 1 DIMM 7
PROC 1 DIMM 10
PROC 1 DIMM 12
PROC 2 DIMM 1
PROC 2 DIMM 4
PROC 2 DIMM 7
PROC 2 DIMM 10
PROC 2 DIMM 12
这些都是4GB 三星模块(部件编号 M393B5270CH0-CH9)
我看过了HP 文档介绍了如何在该服务器中填充内存如果我理解正确的话,当前位于 DIMM 12 中的模块应该放置在 DIMM 3 插槽中。这样的错误配置可以解释我得到的结果吗?
更新 3:
我现在已移除 2 个模块,以便在每侧获得 4x4 GB (4-4),放置在 1-4-7-10 中。不幸的是,我在基准测试中没有看到任何差异。服务器现在不应该能够使用所有四个通道吗?我也尝试过使用stream
多线程进行基准测试,结果非常令人失望。我现在唯一能想到的就是要求托管公司更换整个服务器......
更新 4:
我昨天测试最后一个设置(32 GB)时肯定做错了什么,stream
因为今天我看到了非常好的结果:
foo1:~# ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2000000, Offset = 0
Total memory required = 45.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 24
-------------------------------------------------------------
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 703 microseconds.
(= 703 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 36873.0022 0.0009 0.0009 0.0010
Scale: 34699.5160 0.0009 0.0009 0.0010
Add: 30868.8427 0.0016 0.0016 0.0017
Triad: 25558.7904 0.0019 0.0019 0.0020
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
(我已经放弃了,mbw
因为它只能在单线程模式下运行。它在这个服务器上仍然给出同样糟糕的结果)。
因此问题肯定出在最后两个 4GB 模块上,它们迫使服务器以单通道模式运行,正如 @chx 在下面指出的那样。现在唯一剩下的问题是,是否可以使用 40 GB 并仍然获得全部带宽?我可以使用 2 x 8GB + 6 x 4GB 吗?我将较大的模块放在哪个通道有关系吗?
答案1
您正在强制系统以单通道 (!) 模式运行,每个 CPU 使用 5-5 个模块,而不是 4-4 或 8-8。这就是原因。尝试删除 1 - 1 并报告结果。
6164 是一款 G34 插槽 CPU,如果内存模块设置正确,它可以进行四通道操作。您的设置是最糟糕的。