我们有一台基于 Intel Xeon Gold 6230 的服务器,运行 Ubuntu 20.04.5 LTS,具有特定的内存配置。它有 2 个插槽,每个插槽有 6 个内存通道,8 个内存插槽均装有 32G DIMM 模块,因此 6 个通道中有 2 个装有 2 个内存模块,其余的只有一个,如下所示https://www.thomas-krenn.com/en/wiki/Optimize_memory_performance_of_Intel_Xeon_Scalable_systems#Dual_CPU_systems_with_16_DIMM_slots最后一列:16 个 DIMM(每个 CPU 8 个)。
这导致每个 NUMA 节点的物理地址空间分裂成 2 个不同的区域:低 3/4 地址在 6 个通道之间交织,而高 1/4 地址仅在 2 个通道之间交织。
当我们尝试使用大页面进行计算时,我们意识到了这一点,当线程数 >= 12 时,速度减慢了 2 倍,而不是预期的加速,因为出于某种原因,大页面往往分配在不足的上部 1/4 物理地址中。
我尝试通过以下方式排除这些区域
GRUB_CMDLINE_LINUX_DEFAULT="memmap=0x1000000000\$0x3040000000 memmap=0x1000000000\$0x7040000000"
但/etc/default/grub
服务器根本无法使用这些参数进行启动。
所以问题是:有没有办法通过将这些不足的物理地址范围标记为保留,或为它们创建自定义 NUMA 节点等方式来阻止操作系统使用这些不足的物理地址范围?除了移除额外的 4 个 DIMM 模块,这将是一个有点简单的解决方案 :)
dmidecode --type 17 | grep '^Handle\|Bank Locator'
以下是和的输出dmidecode --type 20 | grep 'Handle\|ing Address'
Handle 0x0010, DMI type 17, 84 bytes
Bank Locator: P0_Node0_Channel0_Dimm0
Handle 0x0011, DMI type 17, 84 bytes
Bank Locator: P0_Node0_Channel0_Dimm1
Handle 0x0012, DMI type 17, 84 bytes
Bank Locator: P0_Node0_Channel1_Dimm0
Handle 0x0013, DMI type 17, 84 bytes
Bank Locator: P0_Node0_Channel2_Dimm0
Handle 0x0014, DMI type 17, 84 bytes
Bank Locator: P0_Node1_Channel0_Dimm0
Handle 0x0015, DMI type 17, 84 bytes
Bank Locator: P0_Node1_Channel0_Dimm1
Handle 0x0016, DMI type 17, 84 bytes
Bank Locator: P0_Node1_Channel1_Dimm0
Handle 0x0017, DMI type 17, 84 bytes
Bank Locator: P0_Node1_Channel2_Dimm0
Handle 0x0018, DMI type 17, 84 bytes
Bank Locator: P1_Node0_Channel0_Dimm0
Handle 0x0019, DMI type 17, 84 bytes
Bank Locator: P1_Node0_Channel0_Dimm1
Handle 0x001A, DMI type 17, 84 bytes
Bank Locator: P1_Node0_Channel1_Dimm0
Handle 0x001B, DMI type 17, 84 bytes
Bank Locator: P1_Node0_Channel2_Dimm0
Handle 0x001C, DMI type 17, 84 bytes
Bank Locator: P1_Node1_Channel0_Dimm0
Handle 0x001D, DMI type 17, 84 bytes
Bank Locator: P1_Node1_Channel0_Dimm1
Handle 0x001E, DMI type 17, 84 bytes
Bank Locator: P1_Node1_Channel1_Dimm0
Handle 0x001F, DMI type 17, 84 bytes
Bank Locator: P1_Node1_Channel2_Dimm0
Handle 0x0021, DMI type 20, 35 bytes
Starting Address: 0x00000000000
Ending Address: 0x0007FFFFFFF
Physical Device Handle: 0x0010
Memory Array Mapped Address Handle: 0x0020
Handle 0x0022, DMI type 20, 35 bytes
Starting Address: 0x00000000000
Ending Address: 0x0007FFFFFFF
Physical Device Handle: 0x0011
Memory Array Mapped Address Handle: 0x0020
Handle 0x0023, DMI type 20, 35 bytes
Starting Address: 0x00000000000
Ending Address: 0x0007FFFFFFF
Physical Device Handle: 0x0012
Memory Array Mapped Address Handle: 0x0020
Handle 0x0024, DMI type 20, 35 bytes
Starting Address: 0x00000000000
Ending Address: 0x0007FFFFFFF
Physical Device Handle: 0x0013
Memory Array Mapped Address Handle: 0x0020
Handle 0x0025, DMI type 20, 35 bytes
Starting Address: 0x00000000000
Ending Address: 0x0007FFFFFFF
Physical Device Handle: 0x0014
Memory Array Mapped Address Handle: 0x0020
Handle 0x0026, DMI type 20, 35 bytes
Starting Address: 0x00000000000
Ending Address: 0x0007FFFFFFF
Physical Device Handle: 0x0015
Memory Array Mapped Address Handle: 0x0020
Handle 0x0027, DMI type 20, 35 bytes
Starting Address: 0x00000000000
Ending Address: 0x0007FFFFFFF
Physical Device Handle: 0x0016
Memory Array Mapped Address Handle: 0x0020
Handle 0x0028, DMI type 20, 35 bytes
Starting Address: 0x00000000000
Ending Address: 0x0007FFFFFFF
Physical Device Handle: 0x0017
Memory Array Mapped Address Handle: 0x0020
Handle 0x002A, DMI type 20, 35 bytes
Starting Address: 0x00100000000
Ending Address: 0x0303FFFFFFF
Physical Device Handle: 0x0010
Memory Array Mapped Address Handle: 0x0029
Handle 0x002B, DMI type 20, 35 bytes
Starting Address: 0x00100000000
Ending Address: 0x0303FFFFFFF
Physical Device Handle: 0x0011
Memory Array Mapped Address Handle: 0x0029
Handle 0x002C, DMI type 20, 35 bytes
Starting Address: 0x00100000000
Ending Address: 0x0303FFFFFFF
Physical Device Handle: 0x0012
Memory Array Mapped Address Handle: 0x0029
Handle 0x002D, DMI type 20, 35 bytes
Starting Address: 0x00100000000
Ending Address: 0x0303FFFFFFF
Physical Device Handle: 0x0013
Memory Array Mapped Address Handle: 0x0029
Handle 0x002E, DMI type 20, 35 bytes
Starting Address: 0x00100000000
Ending Address: 0x0303FFFFFFF
Physical Device Handle: 0x0014
Memory Array Mapped Address Handle: 0x0029
Handle 0x002F, DMI type 20, 35 bytes
Starting Address: 0x00100000000
Ending Address: 0x0303FFFFFFF
Physical Device Handle: 0x0015
Memory Array Mapped Address Handle: 0x0029
Handle 0x0030, DMI type 20, 35 bytes
Starting Address: 0x00100000000
Ending Address: 0x0303FFFFFFF
Physical Device Handle: 0x0016
Memory Array Mapped Address Handle: 0x0029
Handle 0x0031, DMI type 20, 35 bytes
Starting Address: 0x00100000000
Ending Address: 0x0303FFFFFFF
Physical Device Handle: 0x0017
Memory Array Mapped Address Handle: 0x0029
Handle 0x0033, DMI type 20, 35 bytes
Starting Address: 0x03040000000
Ending Address: 0x0403FFFFFFF
Physical Device Handle: 0x0010
Memory Array Mapped Address Handle: 0x0032
Handle 0x0034, DMI type 20, 35 bytes
Starting Address: 0x03040000000
Ending Address: 0x0403FFFFFFF
Physical Device Handle: 0x0011
Memory Array Mapped Address Handle: 0x0032
Handle 0x0035, DMI type 20, 35 bytes
Starting Address: 0x03040000000
Ending Address: 0x0403FFFFFFF
Physical Device Handle: 0x0014
Memory Array Mapped Address Handle: 0x0032
Handle 0x0036, DMI type 20, 35 bytes
Starting Address: 0x03040000000
Ending Address: 0x0403FFFFFFF
Physical Device Handle: 0x0015
Memory Array Mapped Address Handle: 0x0032
Handle 0x0038, DMI type 20, 35 bytes
Starting Address: 0x04040000000
Ending Address: 0x0703FFFFFFF
Physical Device Handle: 0x0018
Memory Array Mapped Address Handle: 0x0037
Handle 0x0039, DMI type 20, 35 bytes
Starting Address: 0x04040000000
Ending Address: 0x0703FFFFFFF
Physical Device Handle: 0x0019
Memory Array Mapped Address Handle: 0x0037
Handle 0x003A, DMI type 20, 35 bytes
Starting Address: 0x04040000000
Ending Address: 0x0703FFFFFFF
Physical Device Handle: 0x001A
Memory Array Mapped Address Handle: 0x0037
Handle 0x003B, DMI type 20, 35 bytes
Starting Address: 0x04040000000
Ending Address: 0x0703FFFFFFF
Physical Device Handle: 0x001B
Memory Array Mapped Address Handle: 0x0037
Handle 0x003C, DMI type 20, 35 bytes
Starting Address: 0x04040000000
Ending Address: 0x0703FFFFFFF
Physical Device Handle: 0x001C
Memory Array Mapped Address Handle: 0x0037
Handle 0x003D, DMI type 20, 35 bytes
Starting Address: 0x04040000000
Ending Address: 0x0703FFFFFFF
Physical Device Handle: 0x001D
Memory Array Mapped Address Handle: 0x0037
Handle 0x003E, DMI type 20, 35 bytes
Starting Address: 0x04040000000
Ending Address: 0x0703FFFFFFF
Physical Device Handle: 0x001E
Memory Array Mapped Address Handle: 0x0037
Handle 0x003F, DMI type 20, 35 bytes
Starting Address: 0x04040000000
Ending Address: 0x0703FFFFFFF
Physical Device Handle: 0x001F
Memory Array Mapped Address Handle: 0x0037
Handle 0x0041, DMI type 20, 35 bytes
Starting Address: 0x07040000000
Ending Address: 0x0803FFFFFFF
Physical Device Handle: 0x0018
Memory Array Mapped Address Handle: 0x0040
Handle 0x0042, DMI type 20, 35 bytes
Starting Address: 0x07040000000
Ending Address: 0x0803FFFFFFF
Physical Device Handle: 0x0019
Memory Array Mapped Address Handle: 0x0040
Handle 0x0043, DMI type 20, 35 bytes
Starting Address: 0x07040000000
Ending Address: 0x0803FFFFFFF
Physical Device Handle: 0x001C
Memory Array Mapped Address Handle: 0x0040
Handle 0x0044, DMI type 20, 35 bytes
Starting Address: 0x07040000000
Ending Address: 0x0803FFFFFFF
Physical Device Handle: 0x001D
Memory Array Mapped Address Handle: 0x0040