我有一台运行 CentOS 8 的服务器,有一天内核崩溃了,我发现在/var/crash
:vmcore
、vmcore-dmesg.txt
和中发现了以下三个文件kexec-dmesg.log
。
我首先查看了vmcore-dmesg.txt
,最后给出了以下信息
[291071.552140] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[291071.552141] {2}[Hardware Error]: event severity: fatal
[291071.552141] {2}[Hardware Error]: Error 0, type: fatal
[291071.552142] {2}[Hardware Error]: section_type: PCIe error
[291071.552142] {2}[Hardware Error]: port_type: 4, root port
[291071.552142] {2}[Hardware Error]: version: 3.0
[291071.552143] {2}[Hardware Error]: command: 0x0547, status: 0x4010
[291071.552143] {2}[Hardware Error]: device_id: 0000:16:01.0
[291071.552143] {2}[Hardware Error]: slot: 82
[291071.552144] {2}[Hardware Error]: secondary_bus: 0x18
[291071.552144] {2}[Hardware Error]: vendor_id: 0x8086, device_id: 0x2031
[291071.552145] {2}[Hardware Error]: class_code: 000406
[291071.552145] {2}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0013
[291071.552145] {2}[Hardware Error]: aer_uncor_status: 0x00000020, aer_uncor_mask: 0x00100000
[291071.552146] {2}[Hardware Error]: aer_uncor_severity: 0x00062030
[291071.552146] {2}[Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000
[291071.552146] Kernel panic - not syncing: Fatal hardware error!
[291071.552147] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 4.18.0-305.3.1.el8.x86_64 #1
[291071.552147] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./EPC621D8A, BIOS P2.10 04/03/2019
[291071.552148] Call Trace:
[291071.552148] <NMI>
[291071.552148] dump_stack+0x5c/0x80
[291071.552149] panic+0xe7/0x2a9
[291071.552149] __ghes_panic.cold.32+0x21/0x21
[291071.552149] ghes_notify_nmi+0x273/0x310
[291071.552149] nmi_handle+0x63/0x110
[291071.552150] default_do_nmi+0x49/0x100
[291071.552150] do_nmi+0x17e/0x1e0
[291071.552150] end_repeat_nmi+0x16/0x6f
[291071.552151] RIP: 0010:intel_idle+0x6b/0xb0
[291071.552151] Code: 40 5c 01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 75 19 e9 07 00 00 00 0f 00 2d 1e 01 55 00 c1 ee 18 b9 01 00 00 00 89 f0 0f 01 c9 <65> 48 8b 04 25 40 5c 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b
[291071.552152] RSP: 0018:ffffffff8fe03e40 EFLAGS: 00000002
[291071.552152] RAX: 0000000000000020 RBX: ffffffff8ff30ba8 RCX: 0000000000000001
[291071.552153] RDX: 0000000000000000 RSI: 0000000000000020 RDI: 0000000000000003
[291071.552153] RBP: ffff9e4a20835ad8 R08: 0000000000000002 R09: 0000000000029700
[291071.552154] R10: 0002cd7f37820a74 R11: ffff9e4a20828be4 R12: ffffffff8ff30a40
[291071.552154] R13: 0000000000000003 R14: 0000000000000003 R15: 0000000000000003
[291071.552154] ? intel_idle+0x6b/0xb0
[291071.552154] ? intel_idle+0x6b/0xb0
[291071.552155] </NMI>
[291071.552155] cpuidle_enter_state+0x87/0x3c0
[291071.552155] cpuidle_enter+0x2c/0x40
[291071.552156] do_idle+0x234/0x260
[291071.552156] cpu_startup_entry+0x6f/0x80
[291071.552156] start_kernel+0x518/0x538
[291071.552157] secondary_startup_64_no_verify+0xc2/0xcb
使用lspci
,我可以找到0000:16.01.0
,
16:01.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port B (rev 02)
它似乎是 PCI-E 根。和
lspci -s 16:01.0 -tvv
0000:16:01.0-[18-1b]----00.0-[19-1b]----03.0-[1a-1b]--+-00.0 Intel Corporation Ethernet Connection X722 for 1GbE
+-00.1 Intel Corporation Ethernet Connection X722 for 1GbE
+-00.2 Intel Corporation Ethernet Connection X722 for 1GbE
\-00.3 Intel Corporation Ethernet Connection X722 for 1GbE
然后我查看了kexec-dmesg.log
文件,上面写着
[Thu Jun 10 20:02:45 2021] Memory manager not clean during takedown.
[Thu Jun 10 20:02:45 2021] WARNING: CPU: 0 PID: 399 at drivers/gpu/drm/drm_mm.c:999 drm_mm_takedown+0x1f/0x30 [drm]
[Thu Jun 10 20:02:45 2021] Modules linked in: amdgpu(+) sd_mod t10_pi sg iommu_v2 gpu_sched i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel drm ahci libahci uas libata usb_storage dm_mirror dm_region_hash dm_log dm_mod fuse overlay squashfs loop
[Thu Jun 10 20:02:45 2021] CPU: 0 PID: 399 Comm: systemd-udevd Tainted: G W --------- - - 4.18.0-305.3.1.el8.x86_64 #1
[Thu Jun 10 20:02:45 2021] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./EPC621D8A, BIOS P2.10 04/03/2019
[Thu Jun 10 20:02:45 2021] RIP: 0010:drm_mm_takedown+0x1f/0x30 [drm]
[Thu Jun 10 20:02:45 2021] Code: f6 c3 48 8d 41 c0 eb bb 0f 1f 00 0f 1f 44 00 00 48 8b 47 38 48 83 c7 38 48 39 c7 75 01 c3 48 c7 c7 58 57 1b c0 e8 da b6 f6 c0 <0f> 0b c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 0f 1f 44 00 00
[Thu Jun 10 20:02:45 2021] RSP: 0018:ffffc90000747a10 EFLAGS: 00010282
[Thu Jun 10 20:02:45 2021] RAX: 0000000000000000 RBX: ffff88805d44caf0 RCX: ffffffff8265f1c8
[Thu Jun 10 20:02:45 2021] RDX: 0000000000000001 RSI: 0000000000000096 RDI: 0000000000000246
[Thu Jun 10 20:02:45 2021] RBP: ffff888050e65030 R08: 00000000000005e6 R09: 0000000000aaaaaa
[Thu Jun 10 20:02:45 2021] R10: 0000000000000000 R11: ffffc900009e0320 R12: ffff88805d44ca00
[Thu Jun 10 20:02:45 2021] R13: ffff888050e64f68 R14: 0000000000000000 R15: 0000000000000000
[Thu Jun 10 20:02:45 2021] FS: 00007f16a3901180(0000) GS:ffff88805ea00000(0000) knlGS:0000000000000000
[Thu Jun 10 20:02:45 2021] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Thu Jun 10 20:02:45 2021] CR2: 0000564d0235b008 CR3: 000000005d5b6002 CR4: 00000000007706b0
[Thu Jun 10 20:02:45 2021] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[Thu Jun 10 20:02:45 2021] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[Thu Jun 10 20:02:45 2021] PKRU: 55555554
[Thu Jun 10 20:02:45 2021] Call Trace:
[Thu Jun 10 20:02:45 2021] amdgpu_gtt_mgr_fini+0x2d/0x80 [amdgpu]
[Thu Jun 10 20:02:45 2021] ttm_bo_clean_mm+0xa8/0xc0 [ttm]
[Thu Jun 10 20:02:45 2021] amdgpu_ttm_fini+0x98/0xe0 [amdgpu]
[Thu Jun 10 20:02:45 2021] amdgpu_bo_fini+0xe/0x30 [amdgpu]
[Thu Jun 10 20:02:45 2021] gmc_v9_0_sw_fini+0x59/0xa0 [amdgpu]
[Thu Jun 10 20:02:45 2021] amdgpu_device_fini+0x297/0x4af [amdgpu]
[Thu Jun 10 20:02:45 2021] amdgpu_driver_unload_kms+0x3e/0x70 [amdgpu]
[Thu Jun 10 20:02:45 2021] amdgpu_driver_load_kms+0x122/0x2a0 [amdgpu]
[Thu Jun 10 20:02:45 2021] amdgpu_pci_probe+0xd1/0x150 [amdgpu]
[Thu Jun 10 20:02:45 2021] local_pci_probe+0x41/0x90
[Thu Jun 10 20:02:45 2021] pci_device_probe+0x105/0x1c0
[Thu Jun 10 20:02:45 2021] really_probe+0x255/0x4a0
[Thu Jun 10 20:02:45 2021] driver_probe_device+0x49/0xc0
[Thu Jun 10 20:02:45 2021] device_driver_attach+0x50/0x60
[Thu Jun 10 20:02:45 2021] __driver_attach+0x61/0x130
[Thu Jun 10 20:02:45 2021] ? device_driver_attach+0x60/0x60
[Thu Jun 10 20:02:45 2021] bus_for_each_dev+0x77/0xc0
[Thu Jun 10 20:02:45 2021] ? klist_add_tail+0x3b/0x70
[Thu Jun 10 20:02:45 2021] bus_add_driver+0x14d/0x1e0
[Thu Jun 10 20:02:45 2021] ? 0xffffffffc07d3000
[Thu Jun 10 20:02:45 2021] driver_register+0x6b/0xb0
[Thu Jun 10 20:02:45 2021] ? 0xffffffffc07d3000
[Thu Jun 10 20:02:45 2021] do_one_initcall+0x46/0x1c3
[Thu Jun 10 20:02:45 2021] ? do_init_module+0x22/0x220
[Thu Jun 10 20:02:45 2021] ? kmem_cache_alloc_trace+0x131/0x270
[Thu Jun 10 20:02:45 2021] do_init_module+0x5a/0x220
[Thu Jun 10 20:02:45 2021] load_module+0x14c5/0x17f0
[Thu Jun 10 20:02:45 2021] ? __switch_to_asm+0x35/0x70
[Thu Jun 10 20:02:45 2021] ? __switch_to_asm+0x41/0x70
[Thu Jun 10 20:02:45 2021] ? __switch_to_asm+0x35/0x70
[Thu Jun 10 20:02:45 2021] ? __switch_to_asm+0x41/0x70
[Thu Jun 10 20:02:45 2021] ? apic_timer_interrupt+0xa/0x20
[Thu Jun 10 20:02:45 2021] ? __do_sys_init_module+0x13b/0x180
[Thu Jun 10 20:02:45 2021] __do_sys_init_module+0x13b/0x180
[Thu Jun 10 20:02:45 2021] do_syscall_64+0x5b/0x1a0
[Thu Jun 10 20:02:45 2021] entry_SYSCALL_64_after_hwframe+0x65/0xca
[Thu Jun 10 20:02:45 2021] RIP: 0033:0x7f16a24df80e
[Thu Jun 10 20:02:45 2021] Code: 48 8b 0d 7d 16 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4a 16 2c 00 f7 d8 64 89 01 48
[Thu Jun 10 20:02:45 2021] RSP: 002b:00007ffc5a383dd8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[Thu Jun 10 20:02:45 2021] RAX: ffffffffffffffda RBX: 0000558aa33c7ee0 RCX: 00007f16a24df80e
[Thu Jun 10 20:02:45 2021] RDX: 0000558aa33c85e0 RSI: 00000000009621ec RDI: 0000558aa3def1a0
[Thu Jun 10 20:02:45 2021] RBP: 0000558aa33c85e0 R08: 0000558aa33c301a R09: 0000000000000003
[Thu Jun 10 20:02:45 2021] R10: 0000558aa33c3010 R11: 0000000000000246 R12: 0000558aa3def1a0
[Thu Jun 10 20:02:45 2021] R13: 0000558aa33dabf0 R14: 0000000000020000 R15: 0000000000000000
[Thu Jun 10 20:02:45 2021] ---[ end trace 0950097d77ca3e03 ]---
在我看来这与 GPU 驱动程序有关。
据我了解,当内核崩溃时,kdump
尝试启动另一个内核以kexec
转储崩溃的内核。然后日志在我看来就像发生了一些 PCI-E 硬件错误导致主内核崩溃,而当内核kdump
启动时,由于 GPU 驱动程序错误,它再次崩溃。我理解正确吗?或者显示的日志kexec-dmesg.log
实际上是主内核的堆栈跟踪?
我的第二个问题是如何理解这些错误消息。由于似乎只有 NIC 连接到 PCI-E 根,我的主板/CPU 有问题吗,或者问题可能出在内核上?
补充一下,我发现/var/log
经常发生以下错误,但不会导致内核崩溃
Jun 7 11:12:20 localhost kernel: {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
Jun 7 11:12:20 localhost kernel: {1}[Hardware Error]: It has been corrected by h/w and requires no further action
Jun 7 11:12:20 localhost kernel: {1}[Hardware Error]: event severity: corrected
Jun 7 11:12:20 localhost kernel: {1}[Hardware Error]: Error 0, type: corrected
Jun 7 11:12:20 localhost kernel: {1}[Hardware Error]: section_type: PCIe error
Jun 7 11:12:20 localhost kernel: {1}[Hardware Error]: port_type: 5, upstream switch port
Jun 7 11:12:20 localhost kernel: {1}[Hardware Error]: version: 3.0
Jun 7 11:12:20 localhost kernel: {1}[Hardware Error]: command: 0x0147, status: 0x0010
Jun 7 11:12:20 localhost kernel: {1}[Hardware Error]: device_id: 0000:18:00.0
Jun 7 11:12:20 localhost kernel: {1}[Hardware Error]: slot: 82
Jun 7 11:12:20 localhost kernel: {1}[Hardware Error]: secondary_bus: 0x19
Jun 7 11:12:20 localhost kernel: {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x37c0
Jun 7 11:12:20 localhost kernel: {1}[Hardware Error]: class_code: 000406
Jun 7 11:12:20 localhost kernel: {1}[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0013
Jun 7 11:12:20 localhost kernel: pcieport 0000:18:00.0: aer_status: 0x00003000, aer_mask: 0x00002000
Jun 7 11:12:20 localhost kernel: pcieport 0000:18:00.0: [12] Timeout
Jun 7 11:12:20 localhost kernel: pcieport 0000:18:00.0: aer_layer=Data Link Layer, aer_agent=Transmitter ID
18:00.0
PCI 桥18:00.0 PCI bridge: Intel Corporation Device 37c0 (rev 09)
在哪里
lspci -s 18:00.0 -tvv
0000:18:00.0-[19-1b]----03.0-[1a-1b]--+-00.0 Intel Corporation Ethernet Connection X722 for 1GbE
+-00.1 Intel Corporation Ethernet Connection X722 for 1GbE
+-00.2 Intel Corporation Ethernet Connection X722 for 1GbE
\-00.3 Intel Corporation Ethernet Connection X722 for 1GbE
任何帮助将不胜感激。