无法使用 crashkernel 让 kdump 转储 vmcore

无法使用 crashkernel 让 kdump 转储 vmcore

我正在尝试让内核crashdump在 Ubuntu 15.10 中使用原装内核正常工作4.2.0-22-generic。我已按照描述的方法进行操作这里这里正是如此。但是,当我通过以下方式手动触发崩溃时:

echo c | sudo tee /proc/sysrq-trigger

系统崩溃并重新启动,但没有保存崩溃输出/var/crash

因为这是 EC2,我没有读/写控制台 - 我只能获得只读控制台输出,并且我没有看到太多有用的输出:

[  473.666303] sysrq: SysRq : Trigger a crash
[  473.668278] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  473.671624] IP: [<ffffffff814c79e6>] sysrq_handle_crash+0x16/0x20
[  473.672244] PGD 3e235c067 PUD 3e2351067 PMD 0
[  473.672244] Oops: 0002 [#1] SMP
[  473.672244] Modules linked in: isofs xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp bridge stp llc iptable_filter ip_tables x_tables ppdev xen_fbfront intel_rapl fb_sys_fops iosf_mbi input_leds serio_raw parport_pc 8250_fintek i2c_piix4 parport mac_hid autofs4 crct10dif_pclmul crc32_pclmul cirrus syscopyarea aesni_intel aes_x86_64 sysfillrect lrw sysimgblt gf128mul ttm glue_helper ablk_helper drm_kms_helper cryptd psmouse drm ixgbevf pata_acpi floppy
[  473.672244] CPU: 3 PID: 2814 Comm: bash Not tainted 4.2.0-22-generic #27-Ubuntu
[  473.672244] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/07/2015
[  473.672244] task: ffff8803d1d86e00 ti: ffff8803dc414000 task.ti: ffff8803dc414000
[  473.672244] RIP: 0010:[<ffffffff814c79e6>]  [<ffffffff814c79e6>] sysrq_handle_crash+0x16/0x20
[  473.672244] RSP: 0018:ffff8803dc417e28  EFLAGS: 00010246
[  473.672244] RAX: 000000000000000f RBX: 0000000000000063 RCX: 0000000000000000
[  473.672244] RDX: 0000000000000000 RSI: ffff8803ff2ce938 RDI: 0000000000000063
[  473.672244] RBP: ffff8803dc417e28 R08: 0000000000000002 R09: 000000000000024d
[  473.672244] R10: 000000000000a614 R11: 000000000000024d R12: 0000000000000004
[  473.672244] R13: 0000000000000000 R14: ffffffff81cb48e0 R15: 0000000000000000
[  473.672244] FS:  00007fecdb3ca700(0000) GS:ffff8803ff2c0000(0000) knlGS:0000000000000000
[  473.672244] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  473.672244] CR2: 0000000000000000 CR3: 00000003e234b000 CR4: 00000000001406e0
[  473.672244] Stack:
[  473.672244]  ffff8803dc417e58 ffffffff814c821a 0000000000000002 fffffffffffffffb
[  473.672244]  ffff8803dc417f18 0000000000000002 ffff8803dc417e78 ffffffff814c86a3
[  473.672244]  0000000000000002 ffff8803f816c900 ffff8803dc417e98 ffffffff81266aa2
[  473.672244] Call Trace:
[  473.672244]  [<ffffffff814c821a>] __handle_sysrq+0xea/0x140
[  473.672244]  [<ffffffff814c86a3>] write_sysrq_trigger+0x33/0x40
[  473.672244]  [<ffffffff81266aa2>] proc_reg_write+0x42/0x70
[  473.672244]  [<ffffffff811fca68>] __vfs_write+0x18/0x40
[  473.672244]  [<ffffffff811fd3f6>] vfs_write+0xa6/0x1a0
[  473.672244]  [<ffffffff810c3e21>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[  473.672244]  [<ffffffff811fe0e5>] SyS_write+0x55/0xc0
[  473.672244]  [<ffffffff8121b31f>] ? __close_fd+0x8f/0xb0
[  473.672244]  [<ffffffff817f02b2>] entry_SYSCALL_64_fastpath+0x16/0x75
[  473.672244] Code: 45 3b 7d 34 75 e5 4c 89 ef e8 f7 f7 ff ff eb db 0f 1f 44 00 00 0f 1f 44 00 00 55 c7 05 a8 74 a2 00 01 00 00 00 48 89 e5 0f ae f8 <c6> 04 25 00 00 00 00 01 5d c3 0f 1f 44 00 00 55 48 89 e5 53 8d
[  473.672244] RIP  [<ffffffff814c79e6>] sysrq_handle_crash+0x16/0x20
[  473.672244]  RSP <ffff8803dc417e28>
[  473.672244] CR2: 0000000000000000
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
[    0.000000] Linux version 4.2.0-22-generic (buildd@lcy01-22) (gcc version 5.2.1 20151010 (Ubuntu 5.2.1-22ubuntu2) ) #27-Ubuntu SMP Thu Dec 17 22:57:08 UTC 2015 (Ubuntu 4.2.0-22.27-generic 4.2.6)
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.2.0-22-generic root=UUID=9bd55602-81dd-4868-8cfc-b7d63f8f8d7e ro console=tty1 console=ttyS0 crashkernel=384M

...
[    3.021894] piix4_smbus 0000:00:01.3: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr
...
[  OK  ] Started memcached daemon.
         Starting LSB: Execute the kexec -e command to reboot system...
...
[  OK  ] Started LSB: Record successful boot for GRUB.
[  OK  ] Started LSB: automatic crash report generation.
[  OK  ] Started LSB: Execute the kexec -e command to reboot system.
...
[  OK  ] Started LSB: Load kernel image with kexec.
ondemand.service
rc-local.service
grub-common.service
         Stopping LSB: Start NTP daemon...
apport.service
kexec.service
...
lxc.service
[  OK  ] Started LXC Container Initialization and Autoboot Code.
         Starting Container hypervisor based on LXC - boot time check...
[   34.181647] kdump-tools[773]: Starting kdump-tools:  * loaded kdump kernel
kdump-tools.service
[  OK  ] Started Kernel crash dump capture service.
[  OK  ] Started Container hypervisor based on LXC - boot time check.

然后系统就完全恢复在线,除了/var/crash.lock之外什么都没有了kexec_cmd

我已经尝试过crashkernel=128M、、、、、、等等crashkernel=256Mcrashkernel=384M512M256@0256@16M

我甚至尝试SSH在 中启用/etc/default/grub.d/kexec-tools.cfg,使用一台我已验证可以从这台机器访问的机器,SSH_KEY配置了存在、可以运行并且设置了适当权限的手册,但远程机器根本没有显示连接尝试。

输出kdump-config show看起来正确:

DUMP_MODE:        kdump
USE_KDUMP:        1
KDUMP_SYSCTL:     kernel.panic_on_oops=1
KDUMP_COREDIR:    /var/crash
crashkernel addr: 0x2c000000
SSH:              [email protected]
SSH_KEY:          /root/.ssh/id_rsa
HOSTTAG:          ip
current state:    ready to kdump

kexec command:
  /sbin/kexec -p --command-line="BOOT_IMAGE=/boot/vmlinuz-4.2.0-22-generic root=UUID=9bd55602-81dd-4868-8cfc-b7d63f8f8d7e ro console=tty1 console=ttyS0 irqpoll maxcpus=1 nousb systemd.unit=kdump-tools.service" --initrd=/var/lib/kdump/initrd.img /var/lib/kdump/vmlinuz

但是,当我通过以下方式手动触发崩溃时:

echo 1 > /proc/sys/kernel/sysrq
echo c > /proc/sysrq-trigger

系统重新启动,并且没有vmcore.crash写入/var/crash我的远程 SSH 主机。SSH 主机从未看到登录尝试。我通过看到了跟踪输出ec2-get-console-output -r <instance>,并且系统立即重新启动,如上所示。

我很努力地尝试调试它——一切似乎都正确,但没有崩溃报告。

现在,我不确定这是否可能相关,但是,ifquery启动时也会崩溃,并且从未.crash报告过,并且apport不知道它崩溃了。我还没有看到在这里apport真正创建过.crash。这可能是我的崩溃转储出了问题吗?有人可以提供任何关于调试此问题的见解吗?

答案1

答案2

只是一个想法 - 尝试禁用一些与 kdump 集合无关的内存密集型模块,我已经看到许多高性能网络驱动程序在工作中导致 OOM,而且我的家用机器确实有高端显卡,这两个例子都导致在 kdump 上加载大量内存,导致内存短缺,毕竟保留的 kdump 内存只是主机上安装的 RAM 的一小部分,因为它在启动时保留,之后不可用。

因此,要确定哪些模块消耗了大量内存:

 $ lsmod | sort -nk2 -r | head
amdgpu               4116480  16
btrfs                1228800  2
kvm                   655360  0
nfsv4                 638976  2
drm                   487424  8 gpu_sched,drm_kms_helper,amdgpu,ttm
sunrpc                380928  9 nfsv4,auth_rpcgss,lockd,rpcsec_gss_krb5,nfs
aesni_intel           372736  0
fscache               368640  2 nfsv4,nfs
nfs                   299008  2 nfsv4
igb                   221184  0

就我而言,amdgpu 位于顶部,但您可以拥有我在工作中遇到的所有模块,例如ixgbe,,,等等。i40emlx5_core

要仅为 kdump 内核禁用这些,请编辑/etc/default/kdump-tools取消注释(也许复制,然后取消注释)KDUMP_CMDLINE_APPEND,然后添加要列入黑名单的驱动程序。有些可能在内核中,有些在 initrd 中,因此为确保万无一失,请将每个驱动程序添加为 和$driver_name.blacklist=1rd.driver.blacklist=$driver_name如下所示amdgpu

[snip]
#KDUMP_CMDLINE_APPEND="reset_devices systemd.unit=kdump-tools-dump.service nr_cpus=1 irqpoll nousb ata_piix.prefer_ms_hyperv=0"
KDUMP_CMDLINE_APPEND="reset_devices systemd.unit=kdump-tools-dump.service nr_cpus=1 irqpoll nousb ata_piix.prefer_ms_hyperv=0 amdgpu.blacklist=1 rd.driver.blacklist=amdgpu"
[snip]

然后只需重新加载 kdump-tools 并确保新配置已加载:

$ sudo systemctl restart kdump-tools
$ kdump-config show
DUMP_MODE:        kdump
USE_KDUMP:        1
KDUMP_SYSCTL:     kernel.panic_on_oops=1
KDUMP_COREDIR:    /var/crash
crashkernel addr: 0x
   /var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinuz-5.3.0-40-lowlatency
kdump initrd: 
   /var/lib/kdump/initrd.img: symbolic link to /var/lib/kdump/initrd.img-5.3.0-40-lowlatency
current state:    ready to kdump

kexec command:
  /sbin/kexec -p --command-line="BOOT_IMAGE=/@/boot/vmlinuz-5.3.0-40-lowlatency root=UUID=a745358b-a4e6-4a16-a347-5fa3d65e78a7 ro rootflags=subvol=@ quiet splash vt.handoff=1 reset_devices systemd.unit=kdump-tools-dump.service nr_cpus=1 irqpoll nousb ata_piix.prefer_ms_hyperv=0 amdgpu.blacklist=1 rd.driver.blacklist=amdgpu" --initrd=/var/lib/kdump/initrd.img /var/lib/kdump/vmlinuz

然后重试收集。

干杯,T。

相关内容