如何确定 Fedora Linux 服务器随机重启的原因

2024-5-31 • tag-icon

我有一台 f23 linux 机器作为开发服务器运行，过去几周我多次登录时发现它已被重置。有一次它就在我面前重新启动，似乎重置为 BIOS，然后再次开机。

这似乎每 2 或 3 天发生一次。服务器日志仅显示正常操作、cron 等，直到重置并重新启动；

https://paste.fedoraproject.org/518600/33737531/

Jan 01 20:01:02 pc03.config run-parts[19540]: (/etc/cron.hourly) starting mcelog.cron
Jan 01 20:01:02 pc03.config run-parts[19544]: (/etc/cron.hourly) finished mcelog.cron
Jan 01 20:09:10 pc03.config puppet-agent[19565]: Applied catalog in 0.03 seconds
-- Reboot --
Jan 01 20:17:57 pc03.config systemd-journal[372]: Runtime journal is using 8.0M (max allowed 1.5G, trying to leave 2.3G free of 15.6G available → current limit 1.5G).
Jan 01 20:17:57 pc03.config systemd-journal[372]: Runtime journal is using 8.0M (max allowed 1.5G, trying to leave 2.3G free of 15.6G available → current limit 1.5G).
Jan 01 20:17:57 pc03.config kernel: Linux version 4.8.13-100.fc23.x86_64 ([email protected]) (gcc version 5.3.1 20160406 (Red Hat 5.3.1-6) (GCC) ) #1 SMP Fri Dec 9 14:51:40 UTC 2016
Jan 01 20:17:57 pc03.config kernel: Command line: BOOT_IMAGE=/vmlinuz-4.8.13-100.fc23.x86_64 root=/dev/mapper/fedora_pc03-root ro rd.lvm.lv=fedora_pc03/root rd.lvm.lv=fedora_pc03/swap rhgb quiet nouveau.modeset=0 rd.driver.blacklist=nouveau video=vesa:off LANG=en_GB.UTF-8
Jan 01 20:17:57 pc03.config kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Jan 01 20:17:57 pc03.config kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'

然而日志中似乎有很多这样的消息；

Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: It has been corrected by h/w and requires no further action
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: event severity: corrected
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]:  Error 0, type: corrected
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]:  fru_text: CorrectedErr
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]:   section_type: PCIe error
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]:   port_type: 0, PCIe end point
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]:   version: 0.0
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]:   command: 0xffff, status: 0xffff
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]:   device_id: 0000:80:02.3
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]:   slot: 0
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]:   secondary_bus: 0x00
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]:   vendor_id: 0xffff, device_id: 0xffff
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]:   class_code: ffffff

我检查了 BIOS smbios 事件日志，其中只有重启代码 0x17 显示机器在重置后启动，并且没有像我预期的那样注册任何内存重置。

不幸的是，该机器不支持IPMI，因为主板是超微X9DAi

我不确定如何解释该硬件错误消息中的错误代码，但似乎 0000:80:02 对应的是；

[root@pc03 ~]# lspci -s 0000:80:02
80:02.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 2a (rev 07)

我目前正在监控服务器的温度/CPU，因此下次服务器崩溃时，我会对传感器的状态有一个很好的了解。我还可以采取其他步骤来确定此次崩溃的根本原因吗？

相关内容