Proxmox 系统每日崩溃

Proxmox 系统每日崩溃

我一直在凌晨系统崩溃。 00:00 至 08:00 之间。

查看日志,看起来每个核心都慢得像爬行一样,最终崩溃,直到整个系统冻结。从第一次错误到系统崩溃大约需要 25 分钟。这些错误让我想到内存问题,但我不确定这是否只是因为其他一些故障。

该系统是 Super Micro X9DRW-IF,配备双 E5-2630 V2 和 16X 8GB DDR3

操作系统是 Proxmox 最新

核心:
Linux pve1 4.15.18-10-pve #1 SMP PVE 4.15.18-32 (Sat, 19 Jan 2019)

PERF 中断开始花费越来越长的时间,并且内存错误开始发生。在系统冻结之前的 20 分钟到 1 小时内,大约会出现 20 个此类错误。据我了解,PERF 只是 CPU 节流。它会限制到尽可能低的速度,此时系统会缓慢爬行。

Apr 28 07:36:05 pve1 kernel: [36497.018818] perf: interrupt took too long (6737393 > 4247631), lowering kernel.perf_event_max_sample_rate to 250
Apr 28 07:36:05 pve1 kernel: [36497.018914] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
Apr 28 07:36:05 pve1 kernel: [36497.018926] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
Apr 28 07:36:05 pve1 kernel: [36497.019012] {1}[Hardware Error]: event severity: corrected
Apr 28 07:36:05 pve1 kernel: [36497.019112] {1}[Hardware Error]:  Error 0, type: corrected
Apr 28 07:36:05 pve1 kernel: [36497.019115] {1}[Hardware Error]:  fru_text: CorrectedErr
Apr 28 07:36:05 pve1 kernel: [36497.019119] {1}[Hardware Error]:   section_type: memory error
Apr 28 07:36:05 pve1 kernel: [36497.019125] {1}[Hardware Error]:   node: 1 device: 0 
Apr 28 07:36:05 pve1 kernel: [36497.019128] {1}[Hardware Error]:   error_type: 2, single-bit ECC
Apr 28 07:36:05 pve1 kernel: [36497.019297] ghes_edac: Internal error: Can't find EDAC structure
Apr 28 07:36:06 pve1 pve-firewall[2311]: firewall update time (13.994 seconds)
Apr 28 07:36:10 pve1 kernel: [36502.054892] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 451.489 msecs
Apr 28 07:36:17 pve1 pve-firewall[2311]: firewall update time (9.985 seconds)
Apr 28 07:36:20 pve1 pvestatd[2315]: got timeout
Apr 28 07:36:26 pve1 pvestatd[2315]: status update time (33.041 seconds)
Apr 28 07:36:28 pve1 pve-firewall[2311]: firewall update time (11.073 seconds)
Apr 28 07:36:50 pve1 kernel: [36542.038771] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 451.686 msecs
Apr 28 07:36:56 pve1 pve-firewall[2311]: firewall update time (27.943 seconds)
Apr 28 07:36:56 pve1 pvestatd[2315]: status update time (30.979 seconds)
Apr 28 07:37:03 pve1 pve-firewall[2311]: firewall update time (6.031 seconds)

https://pastebin.com/9Z0A49xR

此时我只想了解实际发生的情况。

答案1

我的主机托管主机拉出了服务器并重新安装了所有 DIMM。

这种情况已经 3 天没有发生了。看起来,由于缺乏可靠的连接,连接速度变慢,越来越多的数据被损坏,直到系统在尝试跳过坏地址时崩溃。连接松动也可能导致内存电压下降,从而导致波动。电压保持在范围内,但随着这种情况的发生,电压会变得不稳定。

相关内容