我们有几台 DELL 机器(带有RHEL 7.6
),因为我们从内核消息中看到错误,所以我们更换了机器上的 DIMM 卡
过了一段时间,我们再次检查内核消息,发现了以下内容,我们可以看到有关 RAM 内存的错误(也与 RHEL 案例有关 -https://access.redhat.com/solutions/6961932)
[Mon May 8 21:08:01 2023] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1683580080 SOCKET 0 APIC 0
[Mon May 8 21:08:01 2023] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#1 (channel:1 slot:1 page:0x6f3c77 offset:0xc80 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:2 rank:4)
[Mon May 8 21:08:21 2023] mce: [Hardware Error]: Machine check events logged
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: It has been corrected by h/w and requires no further action
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: event severity: corrected
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: Error 0, type: corrected
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: fru_text: B6
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: section_type: memory error
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: error_status: 0x0000000000000400
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: physical_address: 0x000000446e0d5f00
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: node: 1 card: 1 module: 1 rank: 0 bank: 3 row: 64982 column: 888
[Tue May 9 05:30:29 2023] {13}[Hardware Error]: error_type: 2, single-bit ECC
[Tue May 9 05:30:29 2023] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Tue May 9 05:30:29 2023] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 1: 940000000000009f
[Tue May 9 05:30:29 2023] EDAC sbridge MC0: TSC 30d2ef7e9bfda
[Tue May 9 05:30:29 2023] EDAC sbridge MC0: ADDR 446e0d5f00
[Tue May 9 05:30:29 2023] EDAC sbridge MC0: MISC 0
[Tue May 9 05:30:29 2023] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1683610228 SOCKET 0 APIC 0
[Tue May 9 05:30:29 2023] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#1 (channel:1 slot:1 page:0x446e0d5 offset:0xf00 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:1 ha:0 channel_mask:2 rank:4)
[Tue May 9 05:30:51 2023] mce: [Hardware Error]: Machine check events logged
[Tue May 9 17:52:21 2023] perf: interrupt took too long (380026 > 7861), lowering kernel.perf_event_max_sample_rate to 1000
[Wed May 10 06:27:17 2023] warning: `lshw' uses legacy ethtool link settings API, link modes are only partially reported
为了确保上述消息不是随机消息,我们决定重新启动机器,看看是否重现有关内存的错误消息
但有关 RAM 内存的错误消息仍然存在。
所以我们对从内核消息中看到的问题感到困惑
为什么我们更换了 DIMM 卡后仍然会出现 RAM 错误
我必须在这里提供有关我们所看到的附加信息信息数据库
正如我们上面提到的,IDRAC 没有完成有关 DIMM 卡或 RAM 内存
所以问题是 -dmesg
尽管所有 DIMM 都已更换,但为什么(内核消息)仍会抱怨 RAM 内存?
是否有可能是其他东西出了问题而不是 DIMM 卡出了问题?例如 DELL 机器的主板?
答案1
您看到的错误是单比特 ECC 可纠正内存错误,已由硬件纠正。这些错误不会触发 iDRAC 中列出的故障组件,至少在它们的数量超过某个内部定义的阈值之前不会,但您应该会在 iDRAC SEL(系统事件日志)下看到记录的此内存错误。
不建议混合单列和双列模块,但您的里程可能会因处理器/主板版本的不同而有所不同。