有 MCE 错误,但没有 edac-util 错误?

有 MCE 错误,但没有 edac-util 错误?

我有一台较旧的 HP Z440 塔式机,配备 4x8GB ECC DDR4,运行 Proxmox VE 6.4。最近,它每隔几秒钟就会显示一次 MCE 错误。我安装了 rasdaemon,可以看到它们是内存读取错误。但是,edac-util 没有显示任何问题的迹象。Memtest 通过了,但我知道对于可纠正的错误来说这是正常的。

只有一个插槽,并且 DIMM 安装在插槽 1、3、6 和 8(这似乎是此型号的首选)。

我是否确实存在内存错误?我该如何进一步排除故障?

dmesg:

root@pve:~# dmesg
...
[ 5729.899255] mce_notify_irq: 20 callbacks suppressed
[ 5729.899260] mce: [Hardware Error]: Machine check events logged
[ 5732.907207] mce: [Hardware Error]: Machine check events logged
[ 5792.907319] mce_notify_irq: 19 callbacks suppressed
[ 5792.907323] mce: [Hardware Error]: Machine check events logged
[ 5793.899247] mce: [Hardware Error]: Machine check events logged
[ 5852.911342] mce_notify_irq: 11 callbacks suppressed
[ 5852.911347] mce: [Hardware Error]: Machine check events logged
[ 5853.903354] mce: [Hardware Error]: Machine check events logged

来自 rasdaemon 的错误:

root@pve:~# ras-mc-ctl --errors | tail
1435 2023-05-12 14:58:05 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=5, mcgcap=0x07000c16, status=0xcc00014000010091, addr=0x4ccdc28c0, misc=0x40484886, walltime=0x645e9a4e, cpuid=0x000306f2, bank=0x00000007
1436 2023-05-12 14:58:06 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=8, mcgcap=0x07000c16, status=0xcc00020000010091, addr=0x4d5c831c0, misc=0x140383886, walltime=0x645e9a4f, cpuid=0x000306f2, bank=0x00000007
1437 2023-05-12 14:58:09 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=2, mcgcap=0x07000c16, status=0xcc00008000010091, addr=0x4ccdc28c0, misc=0x403aba86, walltime=0x645e9a52, cpuid=0x000306f2, bank=0x00000007
1438 2023-05-12 14:58:11 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=2, mcgcap=0x07000c16, status=0xcc00008000010091, addr=0x6fd8eee80, misc=0x140282886, walltime=0x645e9a54, cpuid=0x000306f2, bank=0x00000007
1439 2023-05-12 14:58:12 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=2, mcgcap=0x07000c16, status=0xcc00008000010091, addr=0x510122800, misc=0x140282886, walltime=0x645e9a55, cpuid=0x000306f2, bank=0x00000007
1440 2023-05-12 14:58:13 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=4, mcgcap=0x07000c16, status=0xcc00010000010091, addr=0x4ea312a80, misc=0x1403c3c86, walltime=0x645e9a56, cpuid=0x000306f2, bank=0x00000007
1441 2023-05-12 14:58:16 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error, n_errors=1, mcgcap=0x07000c16, status=0x8c00004000010091, addr=0x4ea342a80, misc=0x1403aba86, walltime=0x645e9a59, cpuid=0x000306f2, bank=0x00000007
1442 2023-05-12 14:58:17 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error, n_errors=1, mcgcap=0x07000c16, status=0x8c00004000010091, addr=0x50abf2900, misc=0x1404c4c86, walltime=0x645e9a5a, cpuid=0x000306f2, bank=0x00000007
1443 2023-05-12 14:58:18 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=8, mcgcap=0x07000c16, status=0xcc00020000010091, addr=0x52676fbc0, misc=0x140585886, walltime=0x645e9a5b, cpuid=0x000306f2, bank=0x00000007

edac 没有报告任何错误:

root@pve:~# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
edac-util: No errors to report.

root@pve:/sys/devices/system/edac/mc# tail -n +1 mc*/ce_* mc*/dimm*/dimm_ce_count
==> mc0/ce_count <==
0

==> mc0/ce_noinfo_count <==
0

==> mc0/dimm0/dimm_ce_count <==
0

==> mc0/dimm3/dimm_ce_count <==
0

==> mc0/dimm6/dimm_ce_count <==
0

==> mc0/dimm9/dimm_ce_count <==
0

答案1

我的理解是,在 HERM(硬件事件报告机制)更新破坏了其功能后,edac-utils 不再起作用,因为它依赖于暴露给用户空间的内存错误计数器。相反,内存错误现在留在内核中,用户空间守护进程必须收集它们(rasdaemon)。因此 edac-utils 没有报告任何错误,因为它预期找到错误报告的地方没有错误报告。

rasdaemon 的 Github 页面上有一个稍微复杂的帐户https://github.com/mchehab/rasdaemon。但回答你的问题:是的,你很可能有记忆错误。

相关内容