内存或控制器有故障?

内存或控制器有故障?

我在运行 CentOS 6 且配备 6 * 8GB RAM 的 Dell PowerEdge R420 上遇到了问题。

我开始看到:

Nov  9 16:43:45 hostname kernel: [20343924.149151] sbridge: HANDLING MCE MEMORY ERROR
Nov  9 16:43:45 hostname kernel: [20343924.149156] CPU 0: Machine Check Exception: 0 Bank 9: cc00008c000800c1
Nov  9 16:43:45 hostname kernel: [20343924.149160] TSC 0 ADDR 421c11000 MISC 90024f4dae6988c PROCESSOR 0:206d7 TIME 1478727825 SOCKET 0 APIC 0
[...]

Nov  9 16:43:46 hostname kernel: [20343925.090225] EDAC sbridge: Lost 6 memory errors
Nov  9 16:43:46 hostname kernel: [20343925.090369] EDAC MC0: CE row 2, channel 0, label "CPU_SrcID#0_Channel#3_DIMM#0": 2 Unknown error(s): memory scrubbing on FATAL area OVERFLOW: cpu=0 Err=0008:00c1 (ch=1), addr = 0x421c0c000 => socket=0, Channel=3(mask=8), rank=0
Nov  9 16:43:46 hostname kernel: [20343925.090373]
[...] It repeats multiple time for row 0 and 1 as well.

这让我认为内存 DIMM 没有故障,而是控制器有故障。同一控制器上的所有 3 个 DIMM 都有故障,而另一个控制器上没有故障的可能性很小。

$ edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#0_Channel#1_DIMM#0: 754182 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: CPU_SrcID#0_Channel#2_DIMM#0: 7181 Corrected Errors
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: CPU_SrcID#0_Channel#3_DIMM#0: 16891 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU_SrcID#1_Channel#1_DIMM#0: 0 Corrected Errors
mc1: csrow1: 0 Uncorrected Errors
mc1: csrow1: CPU_SrcID#1_Channel#2_DIMM#0: 0 Corrected Errors
mc1: csrow2: 0 Uncorrected Errors
mc1: csrow2: CPU_SrcID#1_Channel#3_DIMM#0: 0 Corrected Errors

有人能确认这个问题吗?我该怎么办?如果这真的是坏掉的 DIMM,我怎么知道是哪一个?

*-memory
      description: System Memory
      physical id: 1000
      slot: System board or motherboard
      size: 48GiB
    *-bank:0
         description: DIMM DDR3 Synchronous 1333 MHz (0.8 ns)
         product: 36KSF1G72PZ-1G4M1
         vendor: 002C04B3002C
         physical id: 0
         serial: 3067A18C
         slot: DIMM_A1
         size: 8GiB
         width: 64 bits
         clock: 1333MHz (0.8ns)
    *-bank:1
         description: DIMM DDR3 Synchronous 1333 MHz (0.8 ns)
         product: 36KSF1G72PZ-1G4M1
         vendor: 002C04B3002C
         physical id: 1
         serial: 30679E65
         slot: DIMM_A2
         size: 8GiB
         width: 64 bits
         clock: 1333MHz (0.8ns)
    *-bank:2
         description: DIMM DDR3 Synchronous 1333 MHz (0.8 ns)
         product: 36KSF1G72PZ-1G4M1
         vendor: 002C04B3002C
         physical id: 2
         serial: 30679E66
         slot: DIMM_A3
         size: 8GiB
         width: 64 bits
         clock: 1333MHz (0.8ns)

   *-bank:6
         description: DIMM DDR3 Synchronous 1333 MHz (0.8 ns)
         product: 36KSF1G72PZ-1G4M1
         vendor: 002C04B3002C
         physical id: 6
         serial: 30679E63
         slot: DIMM_B1
         size: 8GiB
         width: 64 bits
         clock: 1333MHz (0.8ns)
    *-bank:7
         description: DIMM DDR3 Synchronous 1333 MHz (0.8 ns)
         product: 36KSF1G72PZ-1G4M1
         vendor: 002C04B3002C
         physical id: 7
         serial: 30679CF1
         slot: DIMM_B2
         size: 8GiB
         width: 64 bits
         clock: 1333MHz (0.8ns)
    *-bank:8
         description: DIMM DDR3 Synchronous 1333 MHz (0.8 ns)
         product: 36KSF1G72PZ-1G4M1
         vendor: 002C04B3002C
         physical id: 8
         serial: 30679CEF
         slot: DIMM_B3
         size: 8GiB
         width: 64 bits
         clock: 1333MHz (0.8ns)

答案1

您可以将 edac 列入黑名单,然后 idrac/bmc 将在硬件级别记录插槽。edac 模块阻止硬件记录问题。 https://www.dell.com/support/article/en-us/sln283389/edac-errors-in-messages-log-in-redhat-enterprise-linux-rhel-and-poweredge?lang=en

相关内容