如何使用 vmcore-dmesg.txt 和 kexec-dmesg.log 了解内核崩溃

如何使用 vmcore-dmesg.txt 和 kexec-dmesg.log 了解内核崩溃

我有一台运行 CentOS 8 的服务器,有一天内核崩溃了,我发现在/var/crashvmcorevmcore-dmesg.txt和中发现了以下三个文件kexec-dmesg.log

我首先查看了vmcore-dmesg.txt,最后给出了以下信息

[291071.552140] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[291071.552141] {2}[Hardware Error]: event severity: fatal
[291071.552141] {2}[Hardware Error]:  Error 0, type: fatal
[291071.552142] {2}[Hardware Error]:   section_type: PCIe error
[291071.552142] {2}[Hardware Error]:   port_type: 4, root port
[291071.552142] {2}[Hardware Error]:   version: 3.0
[291071.552143] {2}[Hardware Error]:   command: 0x0547, status: 0x4010
[291071.552143] {2}[Hardware Error]:   device_id: 0000:16:01.0
[291071.552143] {2}[Hardware Error]:   slot: 82
[291071.552144] {2}[Hardware Error]:   secondary_bus: 0x18
[291071.552144] {2}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x2031
[291071.552145] {2}[Hardware Error]:   class_code: 000406
[291071.552145] {2}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0013
[291071.552145] {2}[Hardware Error]:   aer_uncor_status: 0x00000020, aer_uncor_mask: 0x00100000
[291071.552146] {2}[Hardware Error]:   aer_uncor_severity: 0x00062030
[291071.552146] {2}[Hardware Error]:   TLP Header: 00000000 00000000 00000000 00000000
[291071.552146] Kernel panic - not syncing: Fatal hardware error!
[291071.552147] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 4.18.0-305.3.1.el8.x86_64 #1
[291071.552147] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./EPC621D8A, BIOS P2.10 04/03/2019
[291071.552148] Call Trace:
[291071.552148]  <NMI>
[291071.552148]  dump_stack+0x5c/0x80
[291071.552149]  panic+0xe7/0x2a9
[291071.552149]  __ghes_panic.cold.32+0x21/0x21
[291071.552149]  ghes_notify_nmi+0x273/0x310
[291071.552149]  nmi_handle+0x63/0x110
[291071.552150]  default_do_nmi+0x49/0x100
[291071.552150]  do_nmi+0x17e/0x1e0
[291071.552150]  end_repeat_nmi+0x16/0x6f
[291071.552151] RIP: 0010:intel_idle+0x6b/0xb0
[291071.552151] Code: 40 5c 01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 75 19 e9 07 00 00 00 0f 00 2d 1e 01 55 00 c1 ee 18 b9 01 00 00 00 89 f0 0f 01 c9 <65> 48 8b 04 25 40 5c 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b
[291071.552152] RSP: 0018:ffffffff8fe03e40 EFLAGS: 00000002
[291071.552152] RAX: 0000000000000020 RBX: ffffffff8ff30ba8 RCX: 0000000000000001
[291071.552153] RDX: 0000000000000000 RSI: 0000000000000020 RDI: 0000000000000003
[291071.552153] RBP: ffff9e4a20835ad8 R08: 0000000000000002 R09: 0000000000029700
[291071.552154] R10: 0002cd7f37820a74 R11: ffff9e4a20828be4 R12: ffffffff8ff30a40
[291071.552154] R13: 0000000000000003 R14: 0000000000000003 R15: 0000000000000003
[291071.552154]  ? intel_idle+0x6b/0xb0
[291071.552154]  ? intel_idle+0x6b/0xb0
[291071.552155]  </NMI>
[291071.552155]  cpuidle_enter_state+0x87/0x3c0
[291071.552155]  cpuidle_enter+0x2c/0x40
[291071.552156]  do_idle+0x234/0x260
[291071.552156]  cpu_startup_entry+0x6f/0x80
[291071.552156]  start_kernel+0x518/0x538
[291071.552157]  secondary_startup_64_no_verify+0xc2/0xcb

使用lspci,我可以找到0000:16.01.016:01.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port B (rev 02)它似乎是 PCI-E 根。和

lspci -s 16:01.0 -tvv
0000:16:01.0-[18-1b]----00.0-[19-1b]----03.0-[1a-1b]--+-00.0  Intel Corporation Ethernet Connection X722 for 1GbE
                                                      +-00.1  Intel Corporation Ethernet Connection X722 for 1GbE
                                                      +-00.2  Intel Corporation Ethernet Connection X722 for 1GbE
                                                      \-00.3  Intel Corporation Ethernet Connection X722 for 1GbE

然后我查看了kexec-dmesg.log文件,上面写着

[Thu Jun 10 20:02:45 2021] Memory manager not clean during takedown.
[Thu Jun 10 20:02:45 2021] WARNING: CPU: 0 PID: 399 at drivers/gpu/drm/drm_mm.c:999 drm_mm_takedown+0x1f/0x30 [drm]
[Thu Jun 10 20:02:45 2021] Modules linked in: amdgpu(+) sd_mod t10_pi sg iommu_v2 gpu_sched i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel drm ahci libahci uas libata usb_storage dm_mirror dm_region_hash dm_log dm_mod fuse overlay squashfs loop
[Thu Jun 10 20:02:45 2021] CPU: 0 PID: 399 Comm: systemd-udevd Tainted: G        W        --------- -  - 4.18.0-305.3.1.el8.x86_64 #1
[Thu Jun 10 20:02:45 2021] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./EPC621D8A, BIOS P2.10 04/03/2019
[Thu Jun 10 20:02:45 2021] RIP: 0010:drm_mm_takedown+0x1f/0x30 [drm]
[Thu Jun 10 20:02:45 2021] Code: f6 c3 48 8d 41 c0 eb bb 0f 1f 00 0f 1f 44 00 00 48 8b 47 38 48 83 c7 38 48 39 c7 75 01 c3 48 c7 c7 58 57 1b c0 e8 da b6 f6 c0 <0f> 0b c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 0f 1f 44 00 00
[Thu Jun 10 20:02:45 2021] RSP: 0018:ffffc90000747a10 EFLAGS: 00010282
[Thu Jun 10 20:02:45 2021] RAX: 0000000000000000 RBX: ffff88805d44caf0 RCX: ffffffff8265f1c8
[Thu Jun 10 20:02:45 2021] RDX: 0000000000000001 RSI: 0000000000000096 RDI: 0000000000000246
[Thu Jun 10 20:02:45 2021] RBP: ffff888050e65030 R08: 00000000000005e6 R09: 0000000000aaaaaa
[Thu Jun 10 20:02:45 2021] R10: 0000000000000000 R11: ffffc900009e0320 R12: ffff88805d44ca00
[Thu Jun 10 20:02:45 2021] R13: ffff888050e64f68 R14: 0000000000000000 R15: 0000000000000000
[Thu Jun 10 20:02:45 2021] FS:  00007f16a3901180(0000) GS:ffff88805ea00000(0000) knlGS:0000000000000000
[Thu Jun 10 20:02:45 2021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Thu Jun 10 20:02:45 2021] CR2: 0000564d0235b008 CR3: 000000005d5b6002 CR4: 00000000007706b0
[Thu Jun 10 20:02:45 2021] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[Thu Jun 10 20:02:45 2021] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[Thu Jun 10 20:02:45 2021] PKRU: 55555554
[Thu Jun 10 20:02:45 2021] Call Trace:
[Thu Jun 10 20:02:45 2021]  amdgpu_gtt_mgr_fini+0x2d/0x80 [amdgpu]
[Thu Jun 10 20:02:45 2021]  ttm_bo_clean_mm+0xa8/0xc0 [ttm]
[Thu Jun 10 20:02:45 2021]  amdgpu_ttm_fini+0x98/0xe0 [amdgpu]
[Thu Jun 10 20:02:45 2021]  amdgpu_bo_fini+0xe/0x30 [amdgpu]
[Thu Jun 10 20:02:45 2021]  gmc_v9_0_sw_fini+0x59/0xa0 [amdgpu]
[Thu Jun 10 20:02:45 2021]  amdgpu_device_fini+0x297/0x4af [amdgpu]
[Thu Jun 10 20:02:45 2021]  amdgpu_driver_unload_kms+0x3e/0x70 [amdgpu]
[Thu Jun 10 20:02:45 2021]  amdgpu_driver_load_kms+0x122/0x2a0 [amdgpu]
[Thu Jun 10 20:02:45 2021]  amdgpu_pci_probe+0xd1/0x150 [amdgpu]
[Thu Jun 10 20:02:45 2021]  local_pci_probe+0x41/0x90
[Thu Jun 10 20:02:45 2021]  pci_device_probe+0x105/0x1c0
[Thu Jun 10 20:02:45 2021]  really_probe+0x255/0x4a0
[Thu Jun 10 20:02:45 2021]  driver_probe_device+0x49/0xc0
[Thu Jun 10 20:02:45 2021]  device_driver_attach+0x50/0x60
[Thu Jun 10 20:02:45 2021]  __driver_attach+0x61/0x130
[Thu Jun 10 20:02:45 2021]  ? device_driver_attach+0x60/0x60
[Thu Jun 10 20:02:45 2021]  bus_for_each_dev+0x77/0xc0
[Thu Jun 10 20:02:45 2021]  ? klist_add_tail+0x3b/0x70
[Thu Jun 10 20:02:45 2021]  bus_add_driver+0x14d/0x1e0
[Thu Jun 10 20:02:45 2021]  ? 0xffffffffc07d3000
[Thu Jun 10 20:02:45 2021]  driver_register+0x6b/0xb0
[Thu Jun 10 20:02:45 2021]  ? 0xffffffffc07d3000
[Thu Jun 10 20:02:45 2021]  do_one_initcall+0x46/0x1c3
[Thu Jun 10 20:02:45 2021]  ? do_init_module+0x22/0x220
[Thu Jun 10 20:02:45 2021]  ? kmem_cache_alloc_trace+0x131/0x270
[Thu Jun 10 20:02:45 2021]  do_init_module+0x5a/0x220
[Thu Jun 10 20:02:45 2021]  load_module+0x14c5/0x17f0
[Thu Jun 10 20:02:45 2021]  ? __switch_to_asm+0x35/0x70
[Thu Jun 10 20:02:45 2021]  ? __switch_to_asm+0x41/0x70
[Thu Jun 10 20:02:45 2021]  ? __switch_to_asm+0x35/0x70
[Thu Jun 10 20:02:45 2021]  ? __switch_to_asm+0x41/0x70
[Thu Jun 10 20:02:45 2021]  ? apic_timer_interrupt+0xa/0x20
[Thu Jun 10 20:02:45 2021]  ? __do_sys_init_module+0x13b/0x180
[Thu Jun 10 20:02:45 2021]  __do_sys_init_module+0x13b/0x180
[Thu Jun 10 20:02:45 2021]  do_syscall_64+0x5b/0x1a0
[Thu Jun 10 20:02:45 2021]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[Thu Jun 10 20:02:45 2021] RIP: 0033:0x7f16a24df80e
[Thu Jun 10 20:02:45 2021] Code: 48 8b 0d 7d 16 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4a 16 2c 00 f7 d8 64 89 01 48
[Thu Jun 10 20:02:45 2021] RSP: 002b:00007ffc5a383dd8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[Thu Jun 10 20:02:45 2021] RAX: ffffffffffffffda RBX: 0000558aa33c7ee0 RCX: 00007f16a24df80e
[Thu Jun 10 20:02:45 2021] RDX: 0000558aa33c85e0 RSI: 00000000009621ec RDI: 0000558aa3def1a0
[Thu Jun 10 20:02:45 2021] RBP: 0000558aa33c85e0 R08: 0000558aa33c301a R09: 0000000000000003
[Thu Jun 10 20:02:45 2021] R10: 0000558aa33c3010 R11: 0000000000000246 R12: 0000558aa3def1a0
[Thu Jun 10 20:02:45 2021] R13: 0000558aa33dabf0 R14: 0000000000020000 R15: 0000000000000000
[Thu Jun 10 20:02:45 2021] ---[ end trace 0950097d77ca3e03 ]---

在我看来这与 GPU 驱动程序有关。

据我了解,当内核崩溃时,kdump尝试启动另一个内核以kexec转储崩溃的内核。然后日志在我看来就像发生了一些 PCI-E 硬件错误导致主内核崩溃,而当内核kdump启动时,由于 GPU 驱动程序错误,它再次崩溃。我理解正确吗?或者显示的日志kexec-dmesg.log实际上是主内核的堆栈跟踪?

我的第二个问题是如何理解这些错误消息。由于似乎只有 NIC 连接到 PCI-E 根,我的主板/CPU 有问题吗,或者问题可能出在内核上?

补充一下,我发现/var/log经常发生以下错误,但不会导致内核崩溃

Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]: It has been corrected by h/w and requires no further action
Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]: event severity: corrected
Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]:  Error 0, type: corrected
Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]:   section_type: PCIe error
Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]:   port_type: 5, upstream switch port
Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]:   version: 3.0
Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]:   command: 0x0147, status: 0x0010
Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]:   device_id: 0000:18:00.0
Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]:   slot: 82
Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]:   secondary_bus: 0x19
Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x37c0
Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]:   class_code: 000406
Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0013
Jun  7 11:12:20 localhost kernel: pcieport 0000:18:00.0: aer_status: 0x00003000, aer_mask: 0x00002000
Jun  7 11:12:20 localhost kernel: pcieport 0000:18:00.0:    [12] Timeout               
Jun  7 11:12:20 localhost kernel: pcieport 0000:18:00.0: aer_layer=Data Link Layer, aer_agent=Transmitter ID

18:00.0PCI 桥18:00.0 PCI bridge: Intel Corporation Device 37c0 (rev 09)在哪里

 lspci -s 18:00.0 -tvv
0000:18:00.0-[19-1b]----03.0-[1a-1b]--+-00.0  Intel Corporation Ethernet Connection X722 for 1GbE
                                      +-00.1  Intel Corporation Ethernet Connection X722 for 1GbE
                                      +-00.2  Intel Corporation Ethernet Connection X722 for 1GbE
                                      \-00.3  Intel Corporation Ethernet Connection X722 for 1GbE

任何帮助将不胜感激。

相关内容