我在运行 CentOS 6 且配备 6 * 8GB RAM 的 Dell PowerEdge R420 上遇到了问题。
我开始看到:
Nov 9 16:43:45 hostname kernel: [20343924.149151] sbridge: HANDLING MCE MEMORY ERROR
Nov 9 16:43:45 hostname kernel: [20343924.149156] CPU 0: Machine Check Exception: 0 Bank 9: cc00008c000800c1
Nov 9 16:43:45 hostname kernel: [20343924.149160] TSC 0 ADDR 421c11000 MISC 90024f4dae6988c PROCESSOR 0:206d7 TIME 1478727825 SOCKET 0 APIC 0
[...]
Nov 9 16:43:46 hostname kernel: [20343925.090225] EDAC sbridge: Lost 6 memory errors
Nov 9 16:43:46 hostname kernel: [20343925.090369] EDAC MC0: CE row 2, channel 0, label "CPU_SrcID#0_Channel#3_DIMM#0": 2 Unknown error(s): memory scrubbing on FATAL area OVERFLOW: cpu=0 Err=0008:00c1 (ch=1), addr = 0x421c0c000 => socket=0, Channel=3(mask=8), rank=0
Nov 9 16:43:46 hostname kernel: [20343925.090373]
[...] It repeats multiple time for row 0 and 1 as well.
这让我认为内存 DIMM 没有故障,而是控制器有故障。同一控制器上的所有 3 个 DIMM 都有故障,而另一个控制器上没有故障的可能性很小。
$ edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#0_Channel#1_DIMM#0: 754182 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: CPU_SrcID#0_Channel#2_DIMM#0: 7181 Corrected Errors
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: CPU_SrcID#0_Channel#3_DIMM#0: 16891 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU_SrcID#1_Channel#1_DIMM#0: 0 Corrected Errors
mc1: csrow1: 0 Uncorrected Errors
mc1: csrow1: CPU_SrcID#1_Channel#2_DIMM#0: 0 Corrected Errors
mc1: csrow2: 0 Uncorrected Errors
mc1: csrow2: CPU_SrcID#1_Channel#3_DIMM#0: 0 Corrected Errors
有人能确认这个问题吗?我该怎么办?如果这真的是坏掉的 DIMM,我怎么知道是哪一个?
*-memory
description: System Memory
physical id: 1000
slot: System board or motherboard
size: 48GiB
*-bank:0
description: DIMM DDR3 Synchronous 1333 MHz (0.8 ns)
product: 36KSF1G72PZ-1G4M1
vendor: 002C04B3002C
physical id: 0
serial: 3067A18C
slot: DIMM_A1
size: 8GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:1
description: DIMM DDR3 Synchronous 1333 MHz (0.8 ns)
product: 36KSF1G72PZ-1G4M1
vendor: 002C04B3002C
physical id: 1
serial: 30679E65
slot: DIMM_A2
size: 8GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:2
description: DIMM DDR3 Synchronous 1333 MHz (0.8 ns)
product: 36KSF1G72PZ-1G4M1
vendor: 002C04B3002C
physical id: 2
serial: 30679E66
slot: DIMM_A3
size: 8GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:6
description: DIMM DDR3 Synchronous 1333 MHz (0.8 ns)
product: 36KSF1G72PZ-1G4M1
vendor: 002C04B3002C
physical id: 6
serial: 30679E63
slot: DIMM_B1
size: 8GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:7
description: DIMM DDR3 Synchronous 1333 MHz (0.8 ns)
product: 36KSF1G72PZ-1G4M1
vendor: 002C04B3002C
physical id: 7
serial: 30679CF1
slot: DIMM_B2
size: 8GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:8
description: DIMM DDR3 Synchronous 1333 MHz (0.8 ns)
product: 36KSF1G72PZ-1G4M1
vendor: 002C04B3002C
physical id: 8
serial: 30679CEF
slot: DIMM_B3
size: 8GiB
width: 64 bits
clock: 1333MHz (0.8ns)
答案1
您可以将 edac 列入黑名单,然后 idrac/bmc 将在硬件级别记录插槽。edac 模块阻止硬件记录问题。 https://www.dell.com/support/article/en-us/sln283389/edac-errors-in-messages-log-in-redhat-enterprise-linux-rhel-and-poweredge?lang=en