使用 CentOS 7 升级服务器 SuperMicro 后出现 EDAC 内存错误。这些是主板、操作系统还是损坏的 RAM 模块的特定错误?

使用 CentOS 7 升级服务器 SuperMicro 后出现 EDAC 内存错误。这些是主板、操作系统还是损坏的 RAM 模块的特定错误?

我有服务器SuperMicro MBD-X9DRD-EF主板。它在 CentOS7 上运行良好,使用一个 CPU(Intel Original Xeon X6 E5-2620v2)和 128 Gb(8x16 Gb)LVDDR(1600MHz Crucial ECC Reg RTL(PC3-12800))内存。上个月我们升级了这台服务器,增加了第二个 CPU 和额外的 128 Gb 内存,与现有的完全相同。但在密集使用服务器(3-4 天)后,我们开始(非常频繁地)收到此类错误:

[root@GBserver log]# dmesg
[614781.869098] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[614781.869104] EDAC sbridge MC1: CPU 6: Machine Check Event: 0 Bank 7: 8c00004000010090
[614781.869106] EDAC sbridge MC1: TSC 0
[614781.869108] EDAC sbridge MC1: ADDR 38126a6c40
[614781.869110] EDAC sbridge MC1: MISC 14066ca86
[614781.869112] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1473082855 SOCKET 1 APIC 20
[614782.595676] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x38126a6 offset:0xc40 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:1 ha:0 channel_mask:1 rank:1)

edac-util 的输出如下:

[root@GBserver log]# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 296182 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc1: csrow1: 0 Uncorrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors

mc1:csrow0:CPU_SrcID#1_Ha#0_Chan#0_DIMM#0:296182 已更正的错误

这些错误是由主板、CPU 或操作系统故障引起的,还是内存芯片损坏了?我们应该怎么做?如何找到损坏的内存模块?

答案1

3 周后,记录了大约 1100 万个已更正的错误。查看 BIOS 日志后,我发现内存模块已损坏。在此处输入图片描述 这就是我的问题的答案。
接下来,我将移除损坏的模块并用另一个模块替换它。

相关内容