昨天,我们 Docker Swarm 集群中的一个 Java 服务突然停止响应。此服务的任务(容器)挂起,没有响应任何请求,也无法终止。
查看 /var/log/syslog,我们在服务停止时看到以下内容。我们对如何理解这些日志事件有点怀疑。
我们的理解是,该问题与需要重新放回内存的交换数据有关。
调用跟踪将在 Java 服务的最后一条日志条目之后 120 秒写入 /var/log/syslog。
我们在 Hyper-V 上运行集群,并且看到提到了hv_netvsc
(Hyper-V 网络虚拟服务消费者)。这可能是虚拟机管理程序导致的问题吗?
Java 进程相当耗费资源,通常需要大约 15 - 20GB RAM。
我们应该如何理解这个调用跟踪?这个调用跟踪是否能描述问题的根源,还是更有可能是导致服务终止的问题的后果?
Dec 9 20:45:36 prodswarm1 kernel: [216481.008119] INFO: task java:8748 blocked for more than 120 seconds.
Dec 9 20:45:36 prodswarm1 kernel: [216481.008150] Not tainted 4.4.0-101-generic #124-Ubuntu
Dec 9 20:45:36 prodswarm1 kernel: [216481.008169] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 9 20:45:36 prodswarm1 kernel: [216481.008194] java D ffff881fdf63f6b8 0 8748 7863 0x00000100
Dec 9 20:45:36 prodswarm1 kernel: [216481.008198] ffff881fdf63f6b8 ffff881fdf63f690 ffff881fd93af000 ffff881fd93a8e00
Dec 9 20:45:36 prodswarm1 kernel: [216481.008200] ffff881fdf640000 ffff8810278d6e00 7fffffffffffffff ffffffff81840ed0
Dec 9 20:45:36 cprodswarm1phprodswarm1 kernel: [216481.008202] ffff881fdf63f810 ffff881fdf63f6d0 ffffffff818406d5 0000000000000000
Dec 9 20:45:36 prodswarm1 kernel: [216481.008204] Call Trace:
Dec 9 20:45:36 prodswarm1 kernel: [216481.008211] [<ffffffff81840ed0>] ? bit_wait+0x60/0x60
Dec 9 20:45:36 prodswarm1 kernel: [216481.008214] [<ffffffff818406d5>] schedule+0x35/0x80
Dec 9 20:45:36 prodswarm1 kernel: [216481.008216] [<ffffffff81843825>] schedule_timeout+0x1b5/0x270
Dec 9 20:45:36 prodswarm1 kernel: [216481.008222] [<ffffffff810b2d0c>] ? __enqueue_entity+0x6c/0x70
Dec 9 20:45:36 prodswarm1 kernel: [216481.008224] [<ffffffff810b9b87>] ? put_prev_entity+0x97/0x7d0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008228] [<ffffffff810f635b>] ? ktime_get+0x3b/0xb0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008229] [<ffffffff81840ed0>] ? bit_wait+0x60/0x60
Dec 9 20:45:36 prodswarm1 kernel: [216481.008231] [<ffffffff8183fc04>] io_schedule_timeout+0xa4/0x110
Dec 9 20:45:36 prodswarm1 kernel: [216481.008233] [<ffffffff81840eeb>] bit_wait_io+0x1b/0x70
Dec 9 20:45:36 prodswarm1 kernel: [216481.008234] [<ffffffff81840a7d>] __wait_on_bit+0x5d/0x90
Dec 9 20:45:36 prodswarm1 kernel: [216481.008239] [<ffffffff8118efdb>] wait_on_page_bit+0xcb/0xf0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008242] [<ffffffff810c4460>] ? autoremove_wake_function+0x40/0x40
Dec 9 20:45:36 prodswarm1 kernel: [216481.008246] [<ffffffff811a5faa>] shrink_page_list+0x78a/0x7a0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008248] [<ffffffff811a6659>] shrink_inactive_list+0x209/0x520
Dec 9 20:45:36 prodswarm1 kernel: [216481.008251] [<ffffffff811a72e3>] shrink_lruvec+0x583/0x740
Dec 9 20:45:36 prodswarm1 kernel: [216481.008254] [<ffffffff81098601>] ? insert_work+0x11/0xc0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008256] [<ffffffff811a758f>] shrink_zone+0xef/0x2e0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008258] [<ffffffff811a78db>] do_try_to_free_pages+0x15b/0x3b0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008260] [<ffffffff811a7d6a>] try_to_free_mem_cgroup_pages+0xba/0x1a0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008264] [<ffffffff811ffc64>] try_charge+0x194/0x660
Dec 9 20:45:36 prodswarm1 kernel: [216481.008267] [<ffffffff81203f7b>] mem_cgroup_try_charge+0x6b/0x1e0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008269] [<ffffffff811fb71a>] do_huge_pmd_anonymous_page+0x16a/0x580
Dec 9 20:45:36 prodswarm1 kernel: [216481.008271] [<ffffffff811c30b4>] handle_mm_fault+0x15d4/0x1820
Dec 9 20:45:36 prodswarm1 kernel: [216481.008275] [<ffffffff81103ba7>] ? do_futex+0x307/0x540
Dec 9 20:45:36 prodswarm1 kernel: [216481.008279] [<ffffffff8106b577>] __do_page_fault+0x197/0x400
Dec 9 20:45:36 prodswarm1 kernel: [216481.008281] [<ffffffff8106b802>] do_page_fault+0x22/0x30
Dec 9 20:45:36 prodswarm1 kernel: [216481.008283] [<ffffffff81846978>] page_fault+0x28/0x30
Dec 9 20:45:36 prodswarm1 kernel: [216481.008293] INFO: task java:14024 blocked for more than 120 seconds.
Dec 9 20:45:36 prodswarm1 kernel: [216481.008313] Not tainted 4.4.0-101-generic #124-Ubuntu
Dec 9 20:45:36 prodswarm1 kernel: [216481.008331] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 9 20:45:36 prodswarm1 kernel: [216481.008356] java D ffff8800f2c5b158 0 14024 7863 0x00000100
Dec 9 20:45:36 prodswarm1 kernel: [216481.008358] ffff8800f2c5b158 0000000000000000 ffff8810208aaa00 ffff8800f2ac0000
Dec 9 20:45:36 prodswarm1 kernel: [216481.008360] ffff8800f2c5c000 ffff881027956e00 7fffffffffffffff ffffffff81840ed0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008362] ffff8800f2c5b2b0 ffff8800f2c5b170 ffffffff818406d5 0000000000000000
Dec 9 20:45:36 prodswarm1 kernel: [216481.008363] Call Trace:
Dec 9 20:45:36 prodswarm1 kernel: [216481.008365] [<ffffffff81840ed0>] ? bit_wait+0x60/0x60
Dec 9 20:45:36 prodswarm1 kernel: [216481.008367] [<ffffffff818406d5>] schedule+0x35/0x80
Dec 9 20:45:36 prodswarm1 kernel: [216481.008369] [<ffffffff81843825>] schedule_timeout+0x1b5/0x270
Dec 9 20:45:36 prodswarm1 kernel: [216481.008374] [<ffffffffc01f095b>] ? netvsc_start_xmit+0x3ab/0x7b0 [hv_netvsc]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008376] [<ffffffff810f635b>] ? ktime_get+0x3b/0xb0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008377] [<ffffffff81840ed0>] ? bit_wait+0x60/0x60
Dec 9 20:45:36 prodswarm1 kernel: [216481.008379] [<ffffffff8183fc04>] io_schedule_timeout+0xa4/0x110
Dec 9 20:45:36 prodswarm1 kernel: [216481.008381] [<ffffffff81840eeb>] bit_wait_io+0x1b/0x70
Dec 9 20:45:36 prodswarm1 kernel: [216481.008382] [<ffffffff81840a7d>] __wait_on_bit+0x5d/0x90
Dec 9 20:45:36 prodswarm1 kernel: [216481.008385] [<ffffffff8118efdb>] wait_on_page_bit+0xcb/0xf0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008387] [<ffffffff810c4460>] ? autoremove_wake_function+0x40/0x40
Dec 9 20:45:36 prodswarm1 kernel: [216481.008388] [<ffffffff811a5faa>] shrink_page_list+0x78a/0x7a0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008391] [<ffffffff811a6659>] shrink_inactive_list+0x209/0x520
Dec 9 20:45:36 prodswarm1 kernel: [216481.008392] [<ffffffff811a72e3>] shrink_lruvec+0x583/0x740
Dec 9 20:45:36 prodswarm1 kernel: [216481.008394] [<ffffffff811a758f>] shrink_zone+0xef/0x2e0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008396] [<ffffffff811a78db>] do_try_to_free_pages+0x15b/0x3b0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008398] [<ffffffff811a7d6a>] try_to_free_mem_cgroup_pages+0xba/0x1a0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008400] [<ffffffff811ffc64>] try_charge+0x194/0x660
Dec 9 20:45:36 prodswarm1 kernel: [216481.008402] [<ffffffff81203836>] __memcg_kmem_charge_memcg+0x96/0xc0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008406] [<ffffffff811ed611>] new_slab+0x461/0x490
Dec 9 20:45:36 prodswarm1 kernel: [216481.008408] [<ffffffff811ee48b>] ___slab_alloc+0x22b/0x470
Dec 9 20:45:36 prodswarm1 kernel: [216481.008429] [<ffffffffc045597b>] ? nfs_generic_pgio+0x6b/0x2a0 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008431] [<ffffffff81203659>] ? __memcg_kmem_get_cache+0x39/0x130
Dec 9 20:45:36 prodswarm1 kernel: [216481.008437] [<ffffffffc045597b>] ? nfs_generic_pgio+0x6b/0x2a0 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008439] [<ffffffff811ee6f0>] __slab_alloc+0x20/0x40
Dec 9 20:45:36 prodswarm1 kernel: [216481.008441] [<ffffffff811f0085>] __kmalloc+0x1d5/0x250
Dec 9 20:45:36 prodswarm1 kernel: [216481.008448] [<ffffffffc045597b>] nfs_generic_pgio+0x6b/0x2a0 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008456] [<ffffffffc0455c18>] nfs_generic_pg_pgios+0x68/0xd0 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008463] [<ffffffffc0455857>] nfs_pageio_doio+0x27/0x60 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008470] [<ffffffffc045631e>] __nfs_pageio_add_request+0xae/0x480 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008477] [<ffffffffc0456da3>] nfs_pageio_add_request+0xa3/0x1b0 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008487] [<ffffffffc045a76a>] nfs_do_writepage+0xca/0x1c0 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008496] [<ffffffffc045a874>] nfs_writepages_callback+0x14/0x30 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008499] [<ffffffff8119ba4e>] write_cache_pages+0x1ee/0x510
Dec 9 20:45:36 prodswarm1 kernel: [216481.008502] [<ffffffff810c4175>] ? wake_up_bit+0x25/0x30
Dec 9 20:45:36 prodswarm1 kernel: [216481.008511] [<ffffffffc045a860>] ? nfs_do_writepage+0x1c0/0x1c0 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008520] [<ffffffffc045ac95>] nfs_writepages+0xa5/0x150 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008528] [<ffffffffc044bdeb>] ? nfs_write_end+0x16b/0x4b0 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008530] [<ffffffff8119e25e>] do_writepages+0x1e/0x30
Dec 9 20:45:36 prodswarm1 kernel: [216481.008532] [<ffffffff81190f86>] __filemap_fdatawrite_range+0xc6/0x100
Dec 9 20:45:36 prodswarm1 kernel: [216481.008534] [<ffffffff811910ca>] filemap_write_and_wait_range+0x2a/0x70
Dec 9 20:45:36 prodswarm1 kernel: [216481.008539] [<ffffffffc044b1d3>] nfs_file_fsync+0x83/0x130 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008542] [<ffffffff8124509b>] vfs_fsync_range+0x4b/0xb0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008544] [<ffffffff8124511c>] vfs_fsync+0x1c/0x20
Dec 9 20:45:36 prodswarm1 kernel: [216481.008549] [<ffffffffc044b6a6>] nfs_file_flush+0x46/0x60 [nfs]
Dec 9 20:45:36 prodswarm1 kernel: [216481.008552] [<ffffffff8120d7df>] filp_close+0x2f/0x70
Dec 9 20:45:36 prodswarm1 kernel: [216481.008554] [<ffffffff8122f193>] __close_fd+0xa3/0xd0
Dec 9 20:45:36 prodswarm1 kernel: [216481.008556] [<ffffffff8120d843>] SyS_close+0x23/0x50
Dec 9 20:45:36 prodswarm1 kernel: [216481.008558] [<ffffffff818447f2>] entry_SYSCALL_64_fastpath+0x16/0x71
答案1
在我看来,这似乎是交换过程中的某种死锁。您需要监视应用程序的内存使用情况和交换使用情况,并查看发生这种情况时它的行为。这样您就可以知道问题是否是分配给容器的内存太少。