我怎样才能找出我的 RAM 出了什么问题?

我怎样才能找出我的 RAM 出了什么问题?

我最近将我的机器 Ubuntu 16.04 中的内存从 4x8GB 升级到了 8x8GB。零售商承诺新内存将与我的配置兼容,但我注意到,htop有时显示完整的 64GB 内存,有时只显示 48GB 甚至 16GB,每次启动后都不一样。系统每天会死机几次。在一次这样的死机之后,我查看了系统日志:

Nov  7 13:08:09 embpc0032 kernel: [ 4524.820086] EDAC MC0: 7 CE memory read error on CPU_SrcID#0_Ha#0_Chan#3_DIMM#1 (channel:3 slot:1 page:0xb382e offset:0x8c0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0091 socket:0 ha:0 channel_mask:8 rank:4)
Nov  7 13:08:10 embpc0032 kernel: [ 4525.812100] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Nov  7 13:08:10 embpc0032 kernel: [ 4525.812107] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc000b0000010091
Nov  7 13:08:10 embpc0032 kernel: [ 4525.812110] EDAC sbridge MC0: TSC 0 
Nov  7 13:08:10 embpc0032 kernel: [ 4525.812112] EDAC sbridge MC0: ADDR b382fcc0 EDAC sbridge MC0: MISC 14022a286 
Nov  7 13:08:10 embpc0032 kernel: [ 4525.812117] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1510056490 SOCKET 0 APIC 0
Nov  7 13:08:10 embpc0032 kernel: [ 4525.820084] EDAC MC0: 44 CE memory read error on CPU_SrcID#0_Ha#0_Chan#3_DIMM#1 (channel:3 slot:1 page:0xb382f offset:0xcc0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0091 socket:0 ha:0 channel_mask:8 rank:4)
Nov  7 13:08:11 embpc0032 kernel: [ 4526.812091] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Nov  7 13:08:11 embpc0032 kernel: [ 4526.812098] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc0001c000010091
Nov  7 13:08:11 embpc0032 kernel: [ 4526.812101] EDAC sbridge MC0: TSC 0 
Nov  7 13:08:11 embpc0032 kernel: [ 4526.812103] EDAC sbridge MC0: ADDR b382fcc0 EDAC sbridge MC0: MISC 214022a286 
Nov  7 13:08:11 embpc0032 kernel: [ 4526.812108] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1510056491 SOCKET 0 APIC 0
Nov  7 13:08:11 embpc0032 kernel: [ 4526.820076] EDAC MC0: 7 CE memory read error on CPU_SrcID#0_Ha#0_Chan#3_DIMM#1 (channel:3 slot:1 page:0xb382f offset:0xcc0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0091 socket:0 ha:0 channel_mask:8 rank:4)
Nov  7 13:08:12 embpc0032 kernel: [ 4527.812083] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Nov  7 13:08:12 embpc0032 kernel: [ 4527.812091] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc00048000010091
Nov  7 13:08:12 embpc0032 kernel: [ 4527.812093] EDAC sbridge MC0: TSC 0 
Nov  7 13:08:12 embpc0032 kernel: [ 4527.812096] EDAC sbridge MC0: ADDR b382fcc0 EDAC sbridge MC0: MISC 14022a286 
Nov  7 13:08:12 embpc0032 kernel: [ 4527.812101] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1510056492 SOCKET 0 APIC 0
Nov  7 13:08:12 embpc0032 kernel: [ 4527.820096] EDAC MC0: 18 CE memory read error on CPU_SrcID#0_Ha#0_Chan#3_DIMM#1 (channel:3 slot:1 page:0xb382f offset:0xcc0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0091 socket:0 ha:0 channel_mask:8 rank:4)
Nov  7 13:08:13 embpc0032 kernel: [ 4528.812100] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Nov  7 13:08:13 embpc0032 kernel: [ 4528.812108] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc0001c000010091
Nov  7 13:08:13 embpc0032 kernel: [ 4528.812110] EDAC sbridge MC0: TSC 0 
Nov  7 13:08:13 embpc0032 kernel: [ 4528.812112] EDAC sbridge MC0: ADDR b382fcc0 EDAC sbridge MC0: MISC 214022a286 
Nov  7 13:08:13 embpc0032 kernel: [ 4528.812117] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1510056493 SOCKET 0 APIC 0

此后,日志中出现一堆 NULL 符号,冻结并重新启动。可能是什么问题?在这种情况下,channelslot指的是什么?它是配备 DIMM 的四通道主板(Fujitsu D3128-A2)。

E:我找到了手册:

在此处输入图片描述

是否可以肯定地说,错误日志中channel:0 slot:0引用了手册中所谓的 A1?因为在日志中我发现了大约 4000 个内存错误,这些错误都出现在slot:1三个通道中,但从未出现过slot:0。而我新买的所有 RAM 都位于手册中以 2 结尾的插槽中,所以在我看来,所有错误都来自新模块,而不是来自旧模块的单个错误。

E:我今天来上班,启动了电脑。这是 lshw 的输出:

*-memory
      description: System Memory
      physical id: 1e
      slot: System board or motherboard
      size: 16GiB
    *-bank:0
         description: DIMM DDR3 800 MHz (1,2 ns)
         product: HMT41GR7AFR8C
         vendor: Hynix Semiconducto
         physical id: 0
         serial: 50404146
         slot: Node0_Dimm0
         size: 8GiB
         width: 64 bits
         clock: 800MHz (1.2ns)
    *-bank:1
         description: DIMM DDR3 800 MHz (1,2 ns)
         vendor: Undefined
         physical id: 1
         serial: 00000000
         slot: Node0_Dimm1
         size: 8GiB
         width: 64 bits
         clock: 800MHz (1.2ns)
    *-bank:2
         description: DIMM Synchronous [empty]
         product: Dimm2_PartNum
         vendor: Dimm2_Manufacturer
         physical id: 2
         serial: Dimm2_SerNum
         slot: Node0_Dimm2
         width: 64 bits
    *-bank:3
         description: DIMM Synchronous [empty]
         product: Dimm3_PartNum
         vendor: Dimm3_Manufacturer
         physical id: 3
         serial: Dimm3_SerNum
         slot: Node0_Dimm3
         width: 64 bits
    *-bank:4
         description: DIMM Synchronous [empty]
         product: Dimm4_PartNum
         vendor: Dimm4_Manufacturer
         physical id: 4
         serial: Dimm4_SerNum
         slot: Node0_Dimm4
         width: 64 bits
    *-bank:5
         description: DIMM Synchronous [empty]
         product: Dimm5_PartNum
         vendor: Dimm5_Manufacturer
         physical id: 5
         serial: Dimm5_SerNum
         slot: Node0_Dimm5
         width: 64 bits
    *-bank:6
         description: DIMM Synchronous [empty]
         product: Dimm6_PartNum
         vendor: Dimm6_Manufacturer
         physical id: 6
         serial: Dimm6_SerNum
         slot: Node0_Dimm6
         width: 64 bits
    *-bank:7
         description: DIMM Synchronous [empty]
         product: Dimm7_PartNum
         vendor: Dimm7_Manufacturer
         physical id: 7
         serial: Dimm7_SerNum
         slot: Node0_Dimm7
         width: 64 bits

重启后,lshw 的输出如下:

*-memory
      description: System Memory
      physical id: 1e
      slot: System board or motherboard
      size: 48GiB
    *-bank:0
         description: DIMM DDR3 1866 MHz (0,5 ns)
         product: HMT41GR7AFR8C
         vendor: Hynix Semiconducto
         physical id: 0
         serial: 50404146
         slot: Node0_Dimm0
         size: 8GiB
         width: 64 bits
         clock: 1866MHz (0.5ns)
    *-bank:1
         description: DIMM DDR3 1866 MHz (0,5 ns)
         vendor: Undefined
         physical id: 1
         serial: 00000000
         slot: Node0_Dimm1
         size: 8GiB
         width: 64 bits
         clock: 1866MHz (0.5ns)
    *-bank:2
         description: DIMM Synchronous [empty]
         product: Dimm2_PartNum
         vendor: Dimm2_Manufacturer
         physical id: 2
         serial: Dimm2_SerNum
         slot: Node0_Dimm2
         width: 64 bits
    *-bank:3
         description: DIMM Synchronous [empty]
         product: Dimm3_PartNum
         vendor: Dimm3_Manufacturer
         physical id: 3
         serial: Dimm3_SerNum
         slot: Node0_Dimm3
         width: 64 bits
    *-bank:4
         description: DIMM DDR3 1866 MHz (0,5 ns)
         product: HMT41GR7AFR8C
         vendor: Hynix Semiconducto
         physical id: 4
         serial: 50404181
         slot: Node0_Dimm4
         size: 8GiB
         width: 64 bits
         clock: 1866MHz (0.5ns)
    *-bank:5
         description: DIMM DDR3 1866 MHz (0,5 ns)
         vendor: Undefined
         physical id: 5
         serial: 00000000
         slot: Node0_Dimm5
         size: 8GiB
         width: 64 bits
         clock: 1866MHz (0.5ns)
    *-bank:6
         description: DIMM DDR3 1866 MHz (0,5 ns)
         product: HMT41GR7AFR8C
         vendor: Hynix Semiconducto
         physical id: 6
         serial: 50404153
         slot: Node0_Dimm6
         size: 8GiB
         width: 64 bits
         clock: 1866MHz (0.5ns)
    *-bank:7
         description: DIMM DDR3 1866 MHz (0,5 ns)
         vendor: Undefined
         physical id: 7
         serial: 00000000
         slot: Node0_Dimm7
         size: 8GiB
         width: 64 bits
         clock: 1866MHz (0.5ns)

请注意,第一次识别的两个模块列出的统计数据与重启后的统计数据不同(它们实际上是 1866 MHz)。

答案1

要解决此问题...

  1. 首先重新安装所有内存模块
  2. 运行免费的 memtest86.com 内存测试
  3. 将内存模块重新配置到正确的插槽中
  4. 重新测试 memtest86 测试

重新安装

  • 关闭电脑
  • 触摸金属底盘以消散静电荷
  • 拔掉交流电源线
  • 按下电源开关以消散电源中剩余的电荷
  • 移除并重新安装所有内存模块

Memtest86

  • 前往 memtest86.com 并下载免费内存测试
  • 至少跑一次完整的传球,如果有时间的话可以跑更多次
  • 如果失败,则开始一次移除 2 个内存模块并重新测试
  • 如果没有失败,请阅读下一节有关内存配置的内容

配置

内存交错是一种加快内存访问速度的现代技术。它要求使用相等的内存模块对来配置内存。您的高端系统似乎有 4 个内存通道... A/B/C/D。

取出原来的 4 个内存模块,并填满所有模块 1先安装 4 个内存模块,然后使用 4 个新内存模块填充剩余的模块 2职位。

重新运行 memtest86 测试。

相关内容