Linux CPU 软锁定、内核受污染、系统挂起

Linux CPU 软锁定、内核受污染、系统挂起

最近,一些 Linux 虚拟机的 CPU 速度突然显着增加,然后系统挂起。有时,根本没有报告崩溃日志。

以下是发生 CPU 软锁定时显示的消息,然后系统在短时间内挂起。我不确定是什么导致了它,因为标志 G 所污染的内核似乎不是问题?

(G:内核已被污染(由于不同标志指示的原因),但加载到其中的所有模块均已根据 GPL 或与 GPL 兼容的许可证获得许可。)

> ==================================================================== Sep 27 10:21:20 hadoop-9 kernel: BUG: soft lockup - CPU#2 stuck for
> 22s! [kworker/2:1:675] Sep 27 10:21:20 hadoop-9 kernel: Modules linked
> in: dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag
> af_packet_diag netlink_diag iptable_filter fuse btrfs zlib_deflate
> raid6_pq xor vfat msdos fat ext4 mbcache jbd2 binfmt_misc bridge stp
> llc vmw_vsock_vmci_transport vsock coretemp crc32_pclmul
> ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper
> cryptd ppdev vmw_balloon pcspkr i2c_piix4 shpchp sg vmw_vmci
> parport_pc parport ip_tables xfs libcrc32c sr_mod cdrom ata_generic
> pata_acpi sd_mod crc_t10dif crct10dif_generic vmwgfx drm_kms_helper
> ttm crct10dif_pclmul crct10dif_common drm crc32c_intel serio_raw
> ata_piix vmxnet3 libata i2c_core vmw_pvscsi floppy dm_mirror
> dm_region_hash dm_log dm_mod Sep 27 10:21:20 hadoop-9 kernel: CPU: 2
> PID: 675 Comm: kworker/2:1 Tainted: G             L ------------  
> 3.10.0-327.el7.x86_64 #1 Sep 27 10:21:20 hadoop-9 kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference
> Platform, BIOS 6.00 09/21/2015 Sep 27 10:21:20 hadoop-9 kernel:
> Workqueue: events_freezable vmballoon_work [vmw_balloon] Sep 27
> 10:21:20 hadoop-9 kernel: task: ffff880fe3d51700 ti: ffff88003635c000
> task.ti: ffff88003635c000 Sep 27 10:21:20 hadoop-9 kernel: RIP:
> 0010:[<ffffffff8108dbc8>]  [<ffffffff8108dbc8>]
> run_timer_softirq+0x68/0x340 Sep 27 10:21:20 hadoop-9 kernel: RSP:
> 0018:ffff880ffe643e68  EFLAGS: 00000206 Sep 27 10:21:20 hadoop-9
> kernel: RAX: 000000011481b2fc RBX: ffff880ffe654780 RCX:
> ffff880ffe643e90 Sep 27 10:21:20 hadoop-9 kernel: RDX:
> 000000011481b2fb RSI: ffff880ffe643e90 RDI: ffff880fe707c000 Sep 27
> 10:21:20 hadoop-9 kernel: RBP: ffff880ffe643ed0 R08: 0001392dd1824e00
> R09: 00000000000000ff Sep 27 10:21:20 hadoop-9 kernel: R10:
> 0000000000000000 R11: 0000000000000005 R12: ffff880ffe643dd8 Sep 27
> 10:21:20 hadoop-9 kernel: R13: ffffffff8164655d R14: ffff880ffe643ed0
> R15: ffff880fe707c000 Sep 27 10:21:20 hadoop-9 kernel: FS: 
> 0000000000000000(0000) GS:ffff880ffe640000(0000)
> knlGS:0000000000000000 Sep 27 10:21:20 hadoop-9 kernel: CS:  0010 DS:
> 0000 ES: 0000 CR0: 0000000080050033 Sep 27 10:21:20 hadoop-9 kernel:
> CR2: 00000000028511e6 CR3: 000000000194a000 CR4: 00000000003407e0 Sep
> 27 10:21:20 hadoop-9 kernel: DR0: 0000000000000000 DR1:
> 0000000000000000 DR2: 0000000000000000 Sep 27 10:21:20 hadoop-9
> kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400 Sep 27 10:21:20 hadoop-9 kernel: Stack: Sep 27
> 10:21:20 hadoop-9 kernel: ffff880fe707dc28 ffff880fe707d828
> ffff880fe707d428 ffff880fe707d028 Sep 27 10:21:20 hadoop-9 kernel:
> ffff880ffe643ea8 ffff880ffe643e90 ffff880ffe643e90 000000002783652e
> Sep 27 10:21:20 hadoop-9 kernel: 0000000000000001 0000000000000001
> 0000000000000000 ffffffff81943088 Sep 27 10:21:20 hadoop-9 kernel:
> Call Trace: Sep 27 10:21:20 hadoop-9 kernel: <IRQ>  Sep 27 10:21:20
> hadoop-9 kernel:  Sep 27 10:21:20 hadoop-9 kernel:
> [<ffffffff81084b0f>] __do_softirq+0xef/0x280 Sep 27 10:21:20 hadoop-9
> kernel: [<ffffffff8164721c>] call_softirq+0x1c/0x30 Sep 27 10:21:20
> hadoop-9 kernel: [<ffffffff81016fc5>] do_softirq+0x65/0xa0 Sep 27
> 10:21:20 hadoop-9 kernel: [<ffffffff81084ea5>] irq_exit+0x115/0x120
> Sep 27 10:21:20 hadoop-9 kernel: [<ffffffff81647e95>]
> smp_apic_timer_interrupt+0x45/0x60 Sep 27 10:21:20 hadoop-9 kernel:
> [<ffffffff8164655d>] apic_timer_interrupt+0x6d/0x80 Sep 27 10:21:20
> hadoop-9 kernel: <EOI>  Sep 27 10:21:20 hadoop-9 kernel:  Sep 27
> 10:21:20 hadoop-9 kernel: [<ffffffffa02b1553>] ?
> vmballoon_work+0x2b3/0x720 [vmw_balloon] Sep 27 10:21:20 hadoop-9
> kernel: [<ffffffff8109d5fb>] process_one_work+0x17b/0x470 Sep 27
> 10:21:20 hadoop-9 kernel: [<ffffffff8109e3cb>]
> worker_thread+0x11b/0x400 Sep 27 10:21:20 hadoop-9 kernel:
> [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400 Sep 27 10:21:20
> hadoop-9 kernel: [<ffffffff810a5aef>] kthread+0xcf/0xe0 Sep 27
> 10:21:20 hadoop-9 kernel: [<ffffffff810a5a20>] ?
> kthread_create_on_node+0x140/0x140 Sep 27 10:21:20 hadoop-9 kernel:
> [<ffffffff81645858>] ret_from_fork+0x58/0x90 Sep 27 10:21:20 hadoop-9
> kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 Sep
> 27 10:21:20 hadoop-9 kernel: Code: df e8 dd f0 5a 00 48 83 bb 28 20 00
> 00 00 75 3d 48 8b 05 4c 74 9e 00 48 89 43 10 0f 1f 44 00 00 66 83 03
> 02 fb 66 0f 1f 44 00 00 <48> 8b 45 d0 65 48 33 04 25 28 00 00 00 0f 85
> be 02 00 00 48 83  Sep 27 10:21:22 hadoop-9 abrt-dump-oops: Reported 1
> kernel oopses to Abrt Sep 27 10:21:33 hadoop-9 kernel:
> blk_update_request: I/O error, dev fd0, sector 0 Sep 27 10:21:34
> hadoop-9 logger: os-prober: debug: running
> /usr/libexec/os-probes/mounted/05efi on mounted /dev/sda1

答案1

“引用”格式而不是“代码”是一团糟,但在这里我抢救了可能最有用的部分:

Sep 27 10:21:20 hadoop-9 kernel: BUG: soft lockup - CPU#2 stuck for 22s!
...
Sep 27 10:21:20 hadoop-9 kernel: Call Trace: 
Sep 27 10:21:20 hadoop-9 kernel: 
Sep 27 10:21:20 hadoop-9 kernel: 
Sep 27 10:21:20 hadoop-9 kernel: [] __do_softirq+0xef/0x280 
Sep 27 10:21:20 hadoop-9 kernel: [] call_softirq+0x1c/0x30 
Sep 27 10:21:20 hadoop-9 kernel: [] do_softirq+0x65/0xa0 
Sep 27 10:21:20 hadoop-9 kernel: [] irq_exit+0x115/0x120 
Sep 27 10:21:20 hadoop-9 kernel: [] smp_apic_timer_interrupt+0x45/0x60 
Sep 27 10:21:20 hadoop-9 kernel: [] apic_timer_interrupt+0x6d/0x80 
Sep 27 10:21:20 hadoop-9 kernel: 
Sep 27 10:21:20 hadoop-9 kernel: 
Sep 27 10:21:20 hadoop-9 kernel: [] ? vmballoon_work+0x2b3/0x720 [vmw_balloon] 
Sep 27 10:21:20 hadoop-9 kernel: [] process_one_work+0x17b/0x470 
Sep 27 10:21:20 hadoop-9 kernel: [] worker_thread+0x11b/0x400 
Sep 27 10:21:20 hadoop-9 kernel: [] ? rescuer_thread+0x400/0x400 
Sep 27 10:21:20 hadoop-9 kernel: [] kthread+0xcf/0xe0 
Sep 27 10:21:20 hadoop-9 kernel: [] ? kthread_create_on_node+0x140/0x140 
Sep 27 10:21:20 hadoop-9 kernel: [] ret_from_fork+0x58/0x90 
Sep 27 10:21:20 hadoop-9 kernel: [] ? kthread_create_on_node+0x140/0x140

调用跟踪的上半部分看起来像定时器中断触发的非常通用的跟踪。这可能就是检测到软锁定的原因。

最下面的部分好像是系统已经在vmw_balloon驱动里了。该驱动程序与VMware一起使用,它允许底层虚拟化主机告诉VM它暂时无法使用分配给它的全部RAM。如果我理解正确的话,它会在虚拟机的操作系统中进行连续的、不可分页的内存分配,然后将其位置报告给虚拟化主机:“分配给该虚拟机的这部分 RAM 现已被封锁,您现在可以重复使用它在别处”。

事实上,CPU #2 在该单个驱动程序中已忙了 22 秒,这一事实向我表明,RAM 可能会出现一些短缺:要么虚拟机需要已膨胀的内存,而虚拟化主机无法将其返还给内存。及时的方式,或者虚拟化主机在其他地方需要更多的 RAM,并且拼命地试图从虚拟机中获取更多的 RAM。

您应该与虚拟化主机的管理员联系,并让他们检查主机的内存统计信息。如果预计某些虚拟机在其他虚拟机忙碌时几乎总是处于空闲状态,则可能会过度使用一定数量的 RAM 分配(即,分配给虚拟机的 RAM 分配总和大于系统实际可用的内存)。但如果过度使用过多,就会破坏系统的整体性能。此错误可能是虚拟化主机承诺提供过多 RAM 但无法实际提供它的副作用。

如果统计数据显示虚拟化主机的 RAM 不足,那么快速修复可能是将一个或多个虚拟机迁移到另一台具有足够可用 RAM 的主机。如果不可能,则需要向主机系统添加更多实际物理 RAM,这可能需要停机。

相关内容