是什么原因导致Ubuntu经常出现BUG:软锁定警告?

是什么原因导致Ubuntu经常出现BUG:软锁定警告?

编辑:事实证明,这不是发生该错误的唯一实例。它在我的电脑上经常发生。有时它涉及另一个看似随机的过程,例如:chromium-browserteamviewermongod我开始注意到它,因为它在几天前使 MongoDB 数据库崩溃。到目前为止,这种情况至少发生了三次。我以前没有问题,当我使用 Ubuntu 14.04 LTS 时,我的系统是(戴尔 INSPIRON 3650)。它是标准CPU,无需超频。

我有一个安装了 mongodb(3.4) 的 ubuntu 16.04。几个小时前,它的运行突然激增,消耗了 100% 的 CPU 资源。

以下是结果top

top - 21:40:05 up 2 days,  8:30,  1 user,  load average: 17,08, 17,03, 17,01
Tasks: 174 total,  15 running, 153 sleeping,   0 stopped,   6 zombie
%Cpu(s):  0,0 us, 66,8 sy,  0,0 ni, 33,2 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
KiB Mem :  8117148 total,  5307248 free,   981712 used,  1828188 buff/cache
KiB Swap:   520188 total,   520188 free,        0 used.  6427752 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                            
 1160 mongodb   20   0       0      0      0 Z  99,7  0,0 627:44.03 mongod                             
14214 root      20   0   26176   1356   1168 R  99,7  0,0 147:03.56 systemctl                          
 3636 root      20   0  232068  37388  28740 S   0,3  0,5   1:04.03 Xorg   

我尝试终止该进程,但没有成功,任何人都kill -9 <MONGOD PID>无法终止它。我也无法重新启动系统。它只是没有响应。以下是sudo service mongod stop命令的结果

Failed to retrieve unit: Connection timed out
Failed to stop mongod.service: Connection timed out
See system logs and 'systemctl status mongod.service' for details.
Failed to get load state of mongod.service: Connection timed out

我仍然可以ssh进入服务器,但我无法停止 mongod 进程。有人能帮助我吗?

补充说明

命令pstree -p -s 1160给了我

systemd(1)───mongod(1160)─┬─{ftdc}(1247)
                          ├─{mongod}(1239)
                          └─{signalP.gThread}(1214)

按照tailf -100 /var/log/syslog命令,结果更有趣。它显示一条重复的消息,下面是其中一条:

Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505244] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [ftdc:1247]
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505245] Modules linked in: rfcomm xt_multiport iptable_filter ip_tables x_tables rtsx_usb_ms bnep memstick binfmt_misc snd_hda_codec_hdmi intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel arc4 dcdbas dell_smm_hwmon kvm snd_hda_codec_realtek irqbypass snd_hda_codec_generic crct10dif_pclmul rtl8723be crc32_pclmul ghash_clmulni_intel snd_hda_intel aesni_intel snd_hda_codec btcoexist rtl8723_common aes_x86_64 snd_hda_core lrw joydev snd_hwdep glue_helper rtl_pci input_leds rtlwifi snd_pcm ablk_helper snd_seq_midi cryptd mac80211 snd_seq_midi_event snd_rawmidi intel_cstate btusb intel_rapl_perf btrtl snd_seq cfg80211 snd_seq_device snd_timer snd serio_raw soundcore mei_me mei shpchp hci_uart btbcm btqca btintel bluetooth mac_hid intel_lpss_acpi intel_lpss acpi_als kfifo_buf industrialio acpi_pad parport_pc ppdev lp parport autofs4 btrfs xor raid6_pq dm_mirror dm_region_hash dm_log rtsx_usb_sdmmc rtsx_usb hid_generic usbhid nouveau mxm_wmi i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt r8169 psmouse fb_sys_fops mii drm ahci libahci wmi pinctrl_sunrisepoint video pinctrl_intel i2c_hid hid fjes
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505277] CPU: 1 PID: 1247 Comm: ftdc Tainted: G        W    L  4.8.0-53-generic #56~16.04.1-Ubuntu
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505277] Hardware name: Dell Inc. Inspiron 3650/0C2XKD, BIOS 2.0.1 09/03/2015
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505278] task: ffffa024db476ac0 task.stack: ffffa024d83a4000
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505278] RIP: 0010:[<ffffffff8b50b336>]  [<ffffffff8b50b336>] smp_call_function_many+0x1f6/0x250
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505281] RSP: 0018:ffffa024d83a7b38  EFLAGS: 00000202
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505281] RAX: 0000000000000003 RBX: 0000000000000200 RCX: 0000000000000003
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505282] RDX: ffffa024e659d380 RSI: 0000000000000200 RDI: ffffa024e649a288
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505282] RBP: ffffa024d83a7b70 R08: 0000000000000000 R09: 000000000000000d
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505282] R10: 0000000000000008 R11: ffffa024e649a288 R12: ffffa024e649a288
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505283] R13: ffffa024e649a280 R14: ffffffff8b472400 R15: ffffa024d83a7b80
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505284] FS:  00007f871ddd2700(0000) GS:ffffa024e6480000(0000) knlGS:0000000000000000
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505284] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505284] CR2: 00007f95bc40323f CR3: 0000000258e11000 CR4: 00000000003406e0
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505285] Stack:
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505285]  000000000001a240 0100000000000001 ffffa024d3ebf800 ffffffffffffffff
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505287]  ffffa024d3ebfad8 0000000000000000 ffffffffffffffff ffffa024d83a7bb8
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505288]  ffffffff8b472865 ffffa024d3ebf800 0000000000000000 ffffffffffffffff
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505289] Call Trace:
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505291]  [<ffffffff8b472865>] native_flush_tlb_others+0x65/0x130
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505292]  [<ffffffff8b472a43>] flush_tlb_mm_range+0x63/0x150
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505294]  [<ffffffff8b5d62b4>] tlb_flush_mmu_tlbonly+0x64/0xd0
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505295]  [<ffffffff8b5d75b2>] tlb_flush_mmu+0x12/0x20
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505297]  [<ffffffff8b61595d>] zap_huge_pmd+0x20d/0x3b0
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505298]  [<ffffffff8b5d9168>] unmap_page_range+0x928/0x940
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505299]  [<ffffffff8b47fc92>] ? mmput+0x12/0x130
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505301]  [<ffffffff8b5d91fd>] unmap_single_vma+0x7d/0xe0
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505302]  [<ffffffff8b5d9668>] zap_page_range+0xc8/0x140
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505304]  [<ffffffff8b5ef47e>] SyS_madvise+0x43e/0x930
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505305]  [<ffffffff8bc9a876>] entry_SYSCALL_64_fastpath+0x1e/0xa8
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505306] Code: d2 e8 3f 94 33 00 3b 05 ed 3a e5 00 89 c1 0f 8d 99 fe ff ff 48 98 49 8b 55 00 48 03 14 c5 60 c4 35 8c 8b 42 18 a8 01 74 09 f3 90 <8b> 42 18 a8 01 75 f7 eb bf 0f b6 4d d0 4c 89 fa 4c 89 f6 44 89

echo l > /proc/sysrq-trigger以下是 CPU3 的输出

[207345.496706] NMI backtrace for cpu 3
[207345.496707] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G        W    L  4.8.0-53-generic #56~16.04.1-Ubuntu
[207345.496707] Hardware name: Dell Inc. Inspiron 3650/0C2XKD, BIOS 2.0.1 09/03/2015
[207345.496708] task: ffffa024dc428000 task.stack: ffffa024dc460000
[207345.496708] RIP: 0010:[<ffffffff8b4cf41a>]  [<ffffffff8b4cf41a>] native_queued_spin_lock_slowpath+0x17a/0x1a0
[207345.496708] RSP: 0018:ffffa024e6583b30  EFLAGS: 00000002
[207345.496709] RAX: 0000000000000101 RBX: 0000000000000092 RCX: 0000000000000001
[207345.496709] RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffffa024d4111d08
[207345.496709] RBP: ffffa024e6583b30 R08: 0000000000000101 R09: 000000000000002a
[207345.496710] R10: 00000000ffffffff R11: 0000000000000000 R12: ffffa024d4111d08
[207345.496710] R13: ffffa024dc583a00 R14: ffffa024d4111c00 R15: ffffa024d4111c00
[207345.496711] FS:  0000000000000000(0000) GS:ffffa024e6580000(0000) knlGS:0000000000000000
[207345.496711] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[207345.496711] CR2: 00001372a3be0000 CR3: 0000000258e11000 CR4: 00000000003406e0
[207345.496712] Stack:
[207345.496712]  ffffa024e6583b48 ffffffff8bc9a7e7 000000000000002a ffffa024e6583b98
[207345.496712]  ffffffffc02f9dc3 ffffa024dc583580 0000000000000010 ffffa024e6583b98
[207345.496713]  ffffa024d4111c00 000000000000002a ffffa024d4110c00 ffffa024d4111c00
[207345.496713] Call Trace:
[207345.496713]  <IRQ> ^Ad [<ffffffff8bc9a7e7>] _raw_spin_lock_irqsave+0x37/0x3f
[207345.496714]  [<ffffffffc02f9dc3>] nvkm_fantog_update+0x43/0x110 [nouveau]
[207345.496714]  [<ffffffffc02f9ee8>] nvkm_fantog_set+0x38/0x40 [nouveau]
[207345.496714]  [<ffffffffc02f936f>] nvkm_fan_update+0xbf/0x200 [nouveau]
[207345.496715]  [<ffffffffc02f94e9>] nvkm_therm_fan_set+0x19/0x20 [nouveau]
[207345.496715]  [<ffffffffc02f8beb>] nvkm_therm_update+0x9b/0x2e0 [nouveau]
[207345.496715]  [<ffffffffc02f8e47>] nvkm_therm_alarm+0x17/0x20 [nouveau]
[207345.496716]  [<ffffffffc02fc0d0>] nvkm_timer_alarm_trigger+0x100/0x150 [nouveau]
[207345.496716]  [<ffffffffc02fc1ef>] nvkm_timer_alarm+0x7f/0xd0 [nouveau]
[207345.496716]  [<ffffffffc02f9e85>] nvkm_fantog_update+0x105/0x110 [nouveau]
[207345.496717]  [<ffffffffc02f9eaa>] nvkm_fantog_alarm+0x1a/0x20 [nouveau]
[207345.496717]  [<ffffffffc02fc0d0>] nvkm_timer_alarm_trigger+0x100/0x150 [nouveau]
[207345.496718]  [<ffffffffc02fc4f2>] nv04_timer_intr+0x62/0xb0 [nouveau]
[207345.496718]  [<ffffffffc02fbf77>] nvkm_timer_intr+0x17/0x20 [nouveau]
[207345.496718]  [<ffffffffc02aa7c7>] nvkm_subdev_intr+0x17/0x20 [nouveau]
[207345.496719]  [<ffffffffc02eea15>] nvkm_mc_intr+0xe5/0x190 [nouveau]
[207345.496719]  [<ffffffffc02f35f3>] nvkm_pci_intr+0x53/0x80 [nouveau]
[207345.496719]  [<ffffffff8b4e0011>] __handle_irq_event_percpu+0x81/0x1a0
[207345.496720]  [<ffffffff8b4e0162>] handle_irq_event_percpu+0x32/0x80
[207345.496720]  [<ffffffff8b4e01ee>] handle_irq_event+0x3e/0x60
[207345.496720]  [<ffffffff8b4e3bf0>] handle_edge_irq+0x80/0x150
[207345.496721]  [<ffffffff8b4302cd>] handle_irq+0x1d/0x30
[207345.496721]  [<ffffffff8bc9d0db>] do_IRQ+0x4b/0xd0
[207345.496721]  [<ffffffff8bc9b1c2>] common_interrupt+0x82/0x82
[207345.496722]  <EOI> ^Ad [<ffffffff8bb1934b>] ? cpuidle_enter_state+0x12b/0x2d0
[207345.496722]  [<ffffffff8bb19527>] cpuidle_enter+0x17/0x20
[207345.496722]  [<ffffffff8b4c7a0a>] call_cpuidle+0x2a/0x50
[207345.496723]  [<ffffffff8b4c7dee>] cpu_startup_entry+0x29e/0x350
[207345.496723]  [<ffffffff8b4518b1>] start_secondary+0x151/0x190
[207345.496724] Code: 41 39 c0 74 e6 4d 85 c9 c6 07 01 74 30 41 c7 41 08 01 00 00 00 e9 51 ff ff ff 83 fa 01 0f 84 af fe ff ff 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 5d c3 f3 90 4c 8b 09

这是针对 CPU 0 的

[207345.495724] NMI backtrace for cpu 0
[207345.495725] CPU: 0 PID: 14214 Comm: systemctl Tainted: G        W    L  4.8.0-53-generic #56~16.04.1-Ubuntu
[207345.495725] Hardware name: Dell Inc. Inspiron 3650/0C2XKD, BIOS 2.0.1 09/03/2015
[207345.495726] task: ffffa0241a56db80 task.stack: ffffa0241a618000
[207345.495726] RIP: 0010:[<ffffffff8b50b336>]  [<ffffffff8b50b336>] smp_call_function_many+0x1f6/0x250
[207345.495726] RSP: 0018:ffffa0241a61bce0  EFLAGS: 00000202
[207345.495727] RAX: 0000000000000003 RBX: 0000000000000200 RCX: 0000000000000003
[207345.495727] RDX: ffffa024e659cc68 RSI: 0000000000000200 RDI: ffffa024e641a288
[207345.495728] RBP: ffffa0241a61bd18 R08: 0000000000000000 R09: 000000000000000e
[207345.495728] R10: 0000000000000008 R11: ffffa024e641a288 R12: ffffa024e641a288
[207345.495728] R13: ffffa024e641a280 R14: ffffffffc09ca790 R15: 0000000000000000
[207345.495729] FS:  00007fe04de0f880(0000) GS:ffffa024e6400000(0000) knlGS:0000000000000000
[207345.495729] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[207345.495729] CR2: 000055f9604a6040 CR3: 000000019a651000 CR4: 00000000003406f0
[207345.495730] Stack:
[207345.495730]  000000000001a240 0100000000000001 00000000fffffffb ffffffffc09ca790
[207345.495730]  0000000000000000 0000000000000000 0000000000000000 ffffa0241a61bd40
[207345.495731]  ffffffff8b50b46d 00000000fffffffb ffffffff8c267150 0000000000000001
[207345.495731] Call Trace:
[207345.495731]  [<ffffffffc09ca790>] ? kvm_vcpu_block+0x300/0x300 [kvm]
[207345.495732]  [<ffffffff8b50b46d>] on_each_cpu+0x2d/0x60
[207345.495732]  [<ffffffffc09c941f>] kvm_reboot+0x2f/0x40 [kvm]
[207345.495732]  [<ffffffff8b4a4eba>] notifier_call_chain+0x4a/0x70
[207345.495733]  [<ffffffff8b4a51f7>] __blocking_notifier_call_chain+0x47/0x60
[207345.495733]  [<ffffffff8b4a5226>] blocking_notifier_call_chain+0x16/0x20
[207345.495734]  [<ffffffff8b4a64bd>] kernel_restart_prepare+0x1d/0x40
[207345.495734]  [<ffffffff8b4a6582>] kernel_restart+0x12/0x60
[207345.495734]  [<ffffffff8b4a6902>] SYSC_reboot+0x202/0x220
[207345.495735]  [<ffffffff8b63341c>] ? vfs_writev+0x3c/0x50
[207345.495735]  [<ffffffff8b633491>] ? do_writev+0x61/0xf0
[207345.495735]  [<ffffffff8b4a696e>] SyS_reboot+0xe/0x10
[207345.495736]  [<ffffffff8bc9a876>] entry_SYSCALL_64_fastpath+0x1e/0xa8
[207345.495736] Code: d2 e8 3f 94 33 00 3b 05 ed 3a e5 00 89 c1 0f 8d 99 fe ff ff 48 98 49 8b 55 00 48 03 14 c5 60 c4 35 8c 8b 42 18 a8 01 74 09 f3 90 <8b> 42 18 a8 01 75 f7 eb bf 0f b6 4d d0 4c 89 fa 4c 89 f6 
44 89 

对于 CPU1

[207345.495711] NMI backtrace for cpu 1
[207345.495712] CPU: 1 PID: 1247 Comm: ftdc Tainted: G        W    L  4.8.0-53-generic #56~16.04.1-Ubuntu
[207345.495712] Hardware name: Dell Inc. Inspiron 3650/0C2XKD, BIOS 2.0.1 09/03/2015
[207345.495713] task: ffffa024db476ac0 task.stack: ffffa024d83a4000
[207345.495713] RIP: 0010:[<ffffffff8b50b336>]  [<ffffffff8b50b336>] smp_call_function_many+0x1f6/0x250
[207345.495714] RSP: 0018:ffffa024d83a7b38  EFLAGS: 00000202
[207345.495714] RAX: 0000000000000003 RBX: 0000000000000200 RCX: 0000000000000003
[207345.495714] RDX: ffffa024e659d380 RSI: 0000000000000200 RDI: ffffa024e649a288
[207345.495715] RBP: ffffa024d83a7b70 R08: 0000000000000000 R09: 000000000000000d
[207345.495715] R10: 0000000000000008 R11: ffffa024e649a288 R12: ffffa024e649a288
[207345.495716] R13: ffffa024e649a280 R14: ffffffff8b472400 R15: ffffa024d83a7b80
[207345.495716] FS:  00007f871ddd2700(0000) GS:ffffa024e6480000(0000) knlGS:0000000000000000
[207345.495716] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[207345.495717] CR2: 00007f95bc40323f CR3: 0000000258e11000 CR4: 00000000003406e0
[207345.495717] Stack:
[207345.495718]  000000000001a240 0100000000000001 ffffa024d3ebf800 ffffffffffffffff
[207345.495718]  ffffa024d3ebfad8 0000000000000000 ffffffffffffffff ffffa024d83a7bb8
[207345.495718]  ffffffff8b472865 ffffa024d3ebf800 0000000000000000 ffffffffffffffff
[207345.495719] Call Trace:
[207345.495719]  [<ffffffff8b472865>] native_flush_tlb_others+0x65/0x130
[207345.495720]  [<ffffffff8b472a43>] flush_tlb_mm_range+0x63/0x150
[207345.495720]  [<ffffffff8b5d62b4>] tlb_flush_mmu_tlbonly+0x64/0xd0
[207345.495720]  [<ffffffff8b5d75b2>] tlb_flush_mmu+0x12/0x20
[207345.495721]  [<ffffffff8b61595d>] zap_huge_pmd+0x20d/0x3b0
[207345.495721]  [<ffffffff8b5d9168>] unmap_page_range+0x928/0x940
[207345.495721]  [<ffffffff8b47fc92>] ? mmput+0x12/0x130
[207345.495722]  [<ffffffff8b5d91fd>] unmap_single_vma+0x7d/0xe0
[207345.495722]  [<ffffffff8b5d9668>] zap_page_range+0xc8/0x140
[207345.495723]  [<ffffffff8b5ef47e>] SyS_madvise+0x43e/0x930
[207345.495723]  [<ffffffff8bc9a876>] entry_SYSCALL_64_fastpath+0x1e/0xa8
[207345.495724] Code: d2 e8 3f 94 33 00 3b 05 ed 3a e5 00 89 c1 0f 8d 99 fe ff ff 48 98 49 8b 55 00 48 03 14 c5 60 c4 35 8c 8b 42 18 a8 01 74 09 f3 90 <8b> 42 18 a8 01 75 f7 eb bf 0f b6 4d d0 4c 89 fa 4c 89 f6 
44 89

最后是 CPU2

[207330.487609] 4c 89 fa 4c 89 f6 44 89 
[207345.495645] sysrq: SysRq : Show backtrace of all active CPUs
[207345.495648] Sending NMI to all CPUs:
[207345.495699] NMI backtrace for cpu 2
[207345.495699] CPU: 2 PID: 15699 Comm: bash Tainted: G        W    L  4.8.0-53-generic #56~16.04.1-Ubuntu
[207345.495699] Hardware name: Dell Inc. Inspiron 3650/0C2XKD, BIOS 2.0.1 09/03/2015
[207345.495700] task: ffffa02409d30f40 task.stack: ffffa02409dfc000
[207345.495700] RIP: 0010:[<ffffffff8b83c3b0>]  [<ffffffff8b83c3b0>] delay_tsc+0x0/0x60
[207345.495701] RSP: 0018:ffffa02409dffe08  EFLAGS: 00000a07
[207345.495701] RAX: 000000007c3cc000 RBX: 0000000000002710 RCX: 00000000014b0e00
[207345.495702] RDX: 0000000000290d14 RSI: 0000000000000200 RDI: 0000000000290d15
[207345.495702] RBP: ffffa02409dffe10 R08: 0000000000000000 R09: 0000000000000006
[207345.495702] R10: 0000000000000001 R11: 0000000000011bf4 R12: 0000000000000004
[207345.495703] R13: 0000000000000000 R14: ffffffff8c2c1fe0 R15: 0000000000000000
[207345.495703] FS:  00007ff3a9e23700(0000) GS:ffffa024e6500000(0000) knlGS:0000000000000000
[207345.495704] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[207345.495704] CR2: 00000000009a5008 CR3: 0000000189dae000 CR4: 00000000003406e0
[207345.495704] Stack:
[207345.495705]  ffffffff8b83c32b ffffa02409dffe28 ffffffff8b833141 000000000000006c
[207345.495705]  ffffa02409dffe38 ffffffff8b456019 ffffa02409dffe48 ffffffff8b93e6e3
[207345.495706]  ffffa02409dffe78 ffffffff8b93ed9a 0000000000000002 fffffffffffffffb
[207345.495706] Call Trace:
[207345.495706]  [<ffffffff8b83c32b>] ? __const_udelay+0x2b/0x30
[207345.495707]  [<ffffffff8b833141>] nmi_trigger_all_cpu_backtrace+0xc1/0x150
[207345.495707]  [<ffffffff8b456019>] arch_trigger_all_cpu_backtrace+0x19/0x20
[207345.495707]  [<ffffffff8b93e6e3>] sysrq_handle_showallcpus+0x13/0x20
[207345.495708]  [<ffffffff8b93ed9a>] __handle_sysrq+0xea/0x140
[207345.495708]  [<ffffffff8b93f21f>] write_sysrq_trigger+0x2f/0x40
[207345.495709]  [<ffffffff8b6a6872>] proc_reg_write+0x42/0x70
[207345.495709]  [<ffffffff8b632748>] __vfs_write+0x18/0x40
[207345.495709]  [<ffffffff8b632e98>] vfs_write+0xb8/0x1b0
[207345.495710]  [<ffffffff8b6342f5>] SyS_write+0x55/0xc0
[207345.495710]  [<ffffffff8bc9a876>] entry_SYSCALL_64_fastpath+0x1e/0xa8
[207345.495711] Code: 12 48 c1 e2 06 48 89 e5 48 c1 e0 02 48 29 ca f7 e2 48 8d 7a 01 ff 15 b8 59 a7 00 5d c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 <0f> 1f 44 00 00 55 48 89 e5 65 44 8b 05 27 de 7c 74 0f ae e8 0f 

以下是cat /proc/1160/task/1247/stat给我的:

1247 (ftdc) R 1 1160 1160 0 -1 4194368 3495 0 0 0 33464 4158293 0 0 20 0 4 0 645 1763782656 173550 18446744073709551615 94481603162112 94481648347376 140722953733664 140218298338104 140218408507335 256 8405507 6145 1260 0 0 0 -1 1 0 0 1 0 0 94481648352704 94481650153520 94481669767168 140722953735785 140722953735827 140722953735827 140722953736168 0

答案1

您拥有的是一个多线程应用程序,其中一个线程似乎遇到了内核错误。

对错误的一些分析

您尝试关闭mongodID 为 1160 的进程。ID 为 1160 的主线程处于僵尸状态,等待进程中的其他线程死亡。

ID 为 1247 的线程ftdc在调用madvise系统调用时遇到了内核错误,最终陷入了无限循环。

内核有一个看门狗,它可以发现卡住的线程并将堆栈跟踪记录到内核日志中。堆栈跟踪包括线程的名称。由于线程名称和进程名称在这种情况下不同,因此从堆栈跟踪中无法立即看出两者之间的联系。

在您尝试关闭之前,该线程很可能已经卡在该状态mongod

稍后运行echo l > /proc/sysrq-trigger堆栈跟踪时,再次记录了卡住线程。两个堆栈跟踪完全相同,因此很可能一直卡在同一个位置。

报告错误

您需要做的是针对内核提交错误报告。请记住包含看门狗第一次检测到线程卡住时的日志输出。

重启系统

为了使系统恢复正常状态,您必须重新启动。而且无法完全关闭的风险很大。

如果您尝试干净关机,则可能需要物理访问机器才能重置它,除非您有办法远程对机器进行电源循环。

您可以尝试非干净重启,echo b > /proc/sysrq-trigger其破坏性与切断机器电源差不多。它可以避免尝试干净关机时卡住而无法再 ssh 到机器的情况。

无论如何,在启动过程中都需要进行文件系统检查。因此,在尝试以任何方式关闭机器之前,您应该停止将重要数据写入磁盘的服务并运行命令sync

命令可能sync会卡住。但是,由于卡住进程的堆栈跟踪不包含任何文件系统或 I/O 相关内容,因此我认为该风险较小。

由于文件系统不一致,您还可能需要物理访问机器才能启动。不过,这种情况发生的可能性小于尝试干净关机时卡住的可能性。

答案2

由于我没有 50 个声望点,因此我无法发表评论。但是,请不要使用 Kill -9,这会破坏 mongo。请执行以下操作并告知我:

mongo --eval "db.getSiblingDB('admin').shutdownServer()"

或者

mongod --dbpath /path/to/your/db --shutdown

来源:https://docs.mongodb.com/manual/tutorial/manage-mongodb-processes/

答案3

我搜索了 ubuntu 论坛、Google 和任何其他东西,以找到该问题根源的答案,并确保它永远不会再发生。从命令中可以看出topmongodsystemctl是一个僵尸进程,它正在消耗你的 (CPU) 大脑。我无法使用我能想到的任何命令停止它们,kill例如:

kill -9 1160

kill -SIGKILL 1160

mongod --dbpath /path/to/your/db --shutdown

另外还有额外的好处,我无法通过终端重新启动系统。

有人说这是由电源故障(主板功率不足)或超频问题引起的,其他人则说这是驱动程序问题(例如不兼容的 NVIDIA 驱动程序),或者只是 Ubuntu 本身的 CPU 驱动程序问题。不幸的是,我永远也不会知道。

根据@kasperd 的建议,解决循环/挂起的唯一方法是:硬重置 CPU 本身。如果有人知道发生了什么并且有其他意见,我愿意听取建议。

相关内容