如何通过 kdump 确定是什么导致了我们的系统(CentOS 7)崩溃?

如何通过 kdump 确定是什么导致了我们的系统(CentOS 7)崩溃?

目前,我们有 3 个系统在完全相同的硬件和软件配置下运行 CentOS,但遇到了随机系统挂起的情况。这种情况可能在启动后 20 分钟内随机发生,也可能在 1 或 2 周后才会发生。我们运行了一个独立的实时 Ubuntu 映像,并连续运行压力测试,没有任何问题。我们认为这可能是我们系统上安装的驱动程序或软件,但不确定如何确定是什么原因造成的。

如果我们想确定是什么原因导致我们的系统挂起,我们该怎么做?

  KERNEL: /lib/debug/lib/modules/3.10.0-1062.12.1.el7.x86_64/vmlinux
DUMPFILE: /var/crash/127.0.0.1-2020-08-28-19:02:49/vmcore  [PARTIAL DUMP]
    CPUS: 72
    DATE: Fri Aug 28 19:02:35 2020
  UPTIME: 6 days, 13:03:56 LOAD AVERAGE: 7.87, 7.35, 7.45
   TASKS: 5679
NODENAME: zagreb
 RELEASE: 3.10.0-1062.12.1.el7.x86_64
 VERSION: #1 SMP Tue Feb 4 23:02:59 UTC 2020
 MACHINE: x86_64  (3000 Mhz)
  MEMORY: 1023.4 GB
   PANIC: "BUG: unable to handle kernel NULL pointer dereference at           (null)"
     PID: 19718
 COMMAND: "9_scheduler"
    TASK: ffff8a8bc9ab1070  [THREAD_INFO: ffff8a8be0618000]
     CPU: 34
   STATE: TASK_RUNNING (PANIC)
crash>

以下是回溯的日志:

crash> bt
PID: 19718  TASK: ffff8a8bc9ab1070  CPU: 34  COMMAND: "9_scheduler"
 #0 [ffff8a8be061ba90] machine_kexec at ffffffff90665b34
 #1 [ffff8a8be061baf0] __crash_kexec at ffffffff90722352
 #2 [ffff8a8be061bbc0] crash_kexec at ffffffff90722440
 #3 [ffff8a8be061bbd8] oops_end at ffffffff90d85798
 #4 [ffff8a8be061bc00] no_context at ffffffff90675bb4
 #5 [ffff8a8be061bc50] __bad_area_nosemaphore at ffffffff90675e82
 #6 [ffff8a8be061bca0] bad_area_nosemaphore at ffffffff90675fa4
 #7 [ffff8a8be061bcb0] __do_page_fault at ffffffff90d88750
 #8 [ffff8a8be061bd20] do_page_fault at ffffffff90d88975
 #9 [ffff8a8be061bd50] page_fault at ffffffff90d84778
    [exception RIP: anon_vma_clone+117]
    RIP: ffffffff908008e5  RSP: ffff8a8be061be08  RFLAGS: 00010286
    RAX: ffff8a90d42e95f0  RBX: 0000000000000000  RCX: 0000000000ea39f5
    RDX: 0000000000000040  RSI: 0000000000000200  RDI: ffff8a0f7fc07b00
    RBP: ffff8a8be061be48   R8: 000000000001f0a0   R9: ffffffff908008d4
    R10: ffff8ad35135e0c0  R11: 0000000000000000  R12: ffff8a90d42e9d18
    R13: ffff8b0bea29d410  R14: ffff8a90d42e9cb0  R15: ffff8a90d42e95f0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#10 [ffff8a8be061be50] __split_vma at ffffffff907f962e
#11 [ffff8a8be061be90] do_munmap at ffffffff907f992a
#12 [ffff8a8be061bee0] vm_munmap at ffffffff907f9cb5
#13 [ffff8a8be061bf30] sys_munmap at ffffffff907faf52
#14 [ffff8a8be061bf50] system_call_fastpath at ffffffff90d8dede
    RIP: 00007f1ef3f82dd7  RSP: 00007f1e53ffebc0  RFLAGS: 00000246
    RAX: 000000000000000b  RBX: 0000000000040000  RCX: 00007f1ef3f6d727
    RDX: 0000000000000003  RSI: 0000000000040000  RDI: 00007f1d2af40000
    RBP: 0000000000922a40   R8: ffffffffffffffff   R9: 0000000000000000
    R10: 0000000000000022  R11: 0000000000000246  R12: 00007f1e53ffea58
    R13: 00007f1d2af00000  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: 000000000000000b  CS: 0033  SS: 002b
crash>

相关内容