了解和调试频繁的“CPU 上的硬锁定”

了解和调试频繁的“CPU 上的硬锁定”

我的 Ubuntu 盒子经常挂起(每天几次),在 和 中留下类似这样的消息(有时被截断syslogkern.log

Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843824] NMI watchdog: Watchdog detected hard LOCKUP on cpu 13
Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843826] Modules linked in: nls_utf8 btrfs xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs libcrc32c pci_stub vboxpci(OE) vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) snd_hda_codec_hdmi nls_iso8859_1 eeepc_wmi asus_wmi sparse_keymap intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul snd_hda_codec_realtek snd_hda_codec_generic aesni_intel aes_x86_64 lrw gf128mul glue_helper input_leds ablk_helper cryptd serio_raw snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep sb_edac edac_core snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq lpc_ich snd_seq_device snd_timer snd mei_me mei soundcore shpchp 8250_fintek mac_hid parport_pc ppdev lp parport autofs4 hid_generic usbhid hid nouveau mxm_wmi video i2c_algo_bit ttm drm_kms_helper psmouse syscopyarea sysfillrect sysimgblt fb_sys_fops e1000e drm ahci libahci ptp nvme pps_core fjes wmi
Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843881] CPU: 13 PID: 0 Comm: swapper/13 Tainted: G           OE   4.4.0-34-generic #53-Ubuntu
Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843883] Hardware name: ASUS All Series/X99-A/USB 3.1, BIOS 3005 04/11/2016
Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843884] task: ffff8807fb493700 ti: ffff8807fb4a8000 task.ti: ffff8807fb4a8000
Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843885] RIP: 0010:[<ffffffff816c3f61>]  [<ffffffff816c3f61>] cpuidle_enter_state+0x111/0x2b0
Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843890] RSP: 0018:ffff8807fb4abe70  EFLAGS: 00000246
Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843891] RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000018
Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843892] RDX: 00195eb06e5732b1 RSI: 0000000000500101 RDI: 0000000000000000
Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843892] RBP: ffff8807fb4abea8 R08: 000000000032b396 R09: 0000000000000018
Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843893] R10: ffff8807fb4abe20 R11: 000000000000bf7e R12: 0000000000000004
Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843894] R13: ffffe8ffffd40a00 R14: 0000032c4dc034f3 R15: ffffffff81eb1f38
Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843895] FS:  0000000000000000(0000) GS:ffff8807ff540000(0000) knlGS:0000000000000000
Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843895] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843896] CR2: 00001496c022a008 CR3: 0000000002e0a000 CR4: 00000000003426e0
Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843897] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843898] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843898] Stack:
Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843899]  00000000ff553b00 0000032c4e87d60d ffffffff81f36140 ffff8807fb4ac000
Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843900]  ffffe8ffffd40a00 ffffffff81eb1da0 ffff8807fb4a8000 ffff8807fb4abeb8
Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843901]  ffffffff816c4137 ffff8807fb4abed0 ffffffff810c3fe2 ffffffff816c4113
Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843903] Call Trace:
Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843905]  [<ffffffff816c4137>] cpuidle_enter+0x17/0x20
Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843908]  [<ffffffff810c3fe2>] call_cpuidle+0x32/0x60
Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843910]  [<ffffffff816c4113>] ? cpuidle_select+0x13/0x20
Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843911]  [<ffffffff810c42a0>] cpu_startup_entry+0x290/0x350
Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843914]  [<ffffffff810516e4>] start_secondary+0x154/0x190
Sep  1 11:09:55 majestic-daemon kernel: [ 3506.843915] Code: 48 41 89 c4 e8 01 1a a3 ff 48 89 45 d0 0f 1f 44 00 00 31 ff e8 41 ff 9f ff 8b 45 cc 85 c0 0f 85 31 01 00 00 fb 66 0f 1f 44 00 00 <48> 8b 5d d0 48 ba cf f7 53 e3 a5 9b c4 20 4c 29 f3 48 89 d8 48 
Sep  1 11:09:55 majestic-daemon kernel: [ 3550.682295] INFO: rcu_sched detected stalls on CPUs/tasks:
Sep  1 11:09:55 majestic-daemon kernel: [ 3550.682301]  13-...: (1 GPs behind) idle=54b/1/0 softirq=125133/125133 fqs=13613 
Sep  1 11:09:55 majestic-daemon kernel: [ 3550.682302]  (detected by 9, t=15002 jiffies, g=108062, c=108061, q=1917)
Sep  1 11:09:55 majestic-daemon kernel: [ 3550.682304] Task dump for CPU 13:
Sep  1 11:09:55 majestic-daemon kernel: [ 3550.682305] swapper/13      R  running task        0     0      1 0x00000008
Sep  1 11:09:55 majestic-daemon kernel: [ 3550.682307]  ffff8807fb4abe70 0000000000000018 00000000ff553b00 0000032c4e87d60d
Sep  1 11:09:55 majestic-daemon kernel: [ 3550.682308]  ffffffff81f36140 ffff8807fb4ac000 ffffe8ffffd40a00 ffffffff81eb1da0
Sep  1 11:09:55 majestic-daemon kernel: [ 3550.682309]  ffff8807fb4a8000 ffff8807fb4abeb8 ffffffff816c4137 ffff8807fb4abed0
Sep  1 11:09:55 majestic-daemon kernel: [ 3550.682311] Call Trace:
Sep  1 11:09:55 majestic-daemon kernel: [ 3550.682317]  [<ffffffff816c4137>] ? cpuidle_enter+0x17/0x20
Sep  1 11:09:55 majestic-daemon kernel: [ 3550.682320]  [<ffffffff810c3fe2>] ? call_cpuidle+0x32/0x60
Sep  1 11:09:55 majestic-daemon kernel: [ 3550.682322]  [<ffffffff816c4113>] ? cpuidle_select+0x13/0x20
Sep  1 11:09:55 majestic-daemon kernel: [ 3550.682323]  [<ffffffff810c42a0>] ? cpu_startup_entry+0x290/0x350
Sep  1 11:09:55 majestic-daemon kernel: [ 3550.682326]  [<ffffffff810516e4>] ? start_secondary+0x154/0x190

互联网上有很多关于如何修复这些问题的建议,包括安装驱动程序、卸载驱动程序、更改内核设置、更改 BIOS 设置以及许多其他巫术。我还没有看到任何关于在任何特定情况下如何选择特定补救措施的解释。

我应该如何开始调试这样的“硬锁定”?消息输出意味着什么?我应该如何根据其中包含的信息采取行动以开始修复?

相关内容