在将几台主机升级到带有 LTS 支持堆栈 (linux-generic-lts-trusty 3.13.0.40.35) 的 Ubuntu 12.04.5 LTS 后,我们发现内核错误突然激增。这些错误在使用几天后才开始出现,而且(在我未经训练的眼中)似乎没有太多共同点。
3.13.0-71-generic 中是否存在已知问题?我们可以做些什么来修复这个问题(或者至少弄清楚发生了什么)?这些错误在现场发生过,但我们还无法在内部相同的硬件上重现它们,所以我们还没有机会看看升级到最新的 Trusty 内核是否能解决问题。
调用跟踪如下:
Apr 4 23:35:37 hostname kernel: [319114.311718] INFO: task python2.7:5769 blocked for more than 300 seconds.
Apr 4 23:35:37 hostname kernel: [319114.311959] Tainted: P OX 3.13.0-71-generic #114~precise1-Ubuntu
Apr 4 23:35:37 hostname kernel: [319114.312201] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 4 23:35:37 hostname kernel: [319114.312454] python2.7 D ffffffff81811520 0 5769 5767 0x00000000
Apr 4 23:35:37 hostname kernel: [319114.312457] ffff8800023c3be8 0000000000000082 ffff8800023c3ba8 ffff8800023c3fd8
Apr 4 23:35:37 hostname kernel: [319114.312459] 0000000000013180 0000000000013180 ffffffff81c144a0 ffff88000238b000
Apr 4 23:35:37 hostname kernel: [319114.312460] ffff8800023c3bc8 ffff8805ae5374a8 ffff8805ae5374ac 00000000ffffffff
Apr 4 23:35:37 hostname kernel: [319114.312462] Call Trace:
Apr 4 23:35:37 hostname kernel: [319114.312467] [<ffffffff81764799>] schedule+0x29/0x70
Apr 4 23:35:38 hostname kernel: [319114.312469] [<ffffffff81764abe>] schedule_preempt_disabled+0xe/0x10
Apr 4 23:35:38 hostname kernel: [319114.312470] [<ffffffff817668f4>] __mutex_lock_slowpath+0x114/0x1b0
Apr 4 23:35:38 hostname kernel: [319114.312472] [<ffffffff817669b3>] mutex_lock+0x23/0x37
Apr 4 23:35:38 hostname kernel: [319114.312474] [<ffffffff811da631>] do_last+0x281/0x7d0
Apr 4 23:35:38 hostname kernel: [319114.312475] [<ffffffff811dac44>] path_openat+0xc4/0x4c0
Apr 4 23:35:38 hostname kernel: [319114.312477] [<ffffffff811855eb>] ? __handle_mm_fault+0x1db/0x360
Apr 4 23:35:38 hostname kernel: [319114.312478] [<ffffffff81185823>] ? handle_mm_fault+0xb3/0x160
Apr 4 23:35:38 hostname kernel: [319114.312480] [<ffffffff811dbed3>] do_filp_open+0x43/0xa0
Apr 4 23:35:38 hostname kernel: [319114.312483] [<ffffffff811e900e>] ? __alloc_fd+0xce/0x120
Apr 4 23:35:38 hostname kernel: [319114.312486] [<ffffffff811ca786>] do_sys_open+0x136/0x2a0
Apr 4 23:35:38 hostname kernel: [319114.312488] [<ffffffff811ca90e>] SyS_open+0x1e/0x20
Apr 4 23:35:38 hostname kernel: [319114.312491] [<ffffffff8177145d>] system_call_fastpath+0x1a/0x1f
Apr 4 23:35:38 hostname kernel: [319114.312496] INFO: task python2.7:6320 blocked for more than 300 seconds.
Apr 4 23:35:38 hostname kernel: [319114.312758] Tainted: P OX 3.13.0-71-generic #114~precise1-Ubuntu
Apr 4 23:35:38 hostname kernel: [319114.313031] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 4 23:35:38 hostname kernel: [319114.313319] python2.7 D ffffffff81811520 0 6320 6314 0x00000000
Apr 4 23:35:38 hostname kernel: [319114.313320] ffff880021ebdbe8 0000000000000086 0000000000000286 ffff880021ebdfd8
Apr 4 23:35:40 hostname kernel: [319114.313322] 0000000000013180 0000000000013180 ffffffff81c144a0 ffff880002393000
Apr 4 23:35:40 hostname kernel: [319114.313323] ffff880021ebdbc8 ffff8805ae5374a8 ffff8805ae5374ac 00000000ffffffff
Apr 4 15:00:41 hostname kernel: [191113.073832] INFO: task python2.7:8525 blocked for more than 300 seconds.
Apr 4 15:01:00 hostname kernel: [191113.073859] Tainted: P OX 3.13.0-71-generic #114~precise1-Ubuntu
Apr 4 15:01:15 hostname kernel: [191113.073882] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 4 15:01:15 hostname kernel: [191113.073906] python2.7 D 0000000000000000 0 8525 8517 0x00000000
Apr 4 15:01:15 hostname kernel: [191113.073909] ffff880212b3dbe8 0000000000000082 ffff880212b3dba8 ffff880212b3dfd8
Apr 4 15:01:15 hostname kernel: [191113.073911] 0000000000013180 0000000000013180 ffff88000e04e000 ffff8803251de000
Apr 4 15:01:15 hostname kernel: [191113.073913] ffff880212b3dbd8 ffff8802888190a8 ffff8802888190ac 00000000ffffffff
Apr 4 15:01:15 hostname kernel: [191113.073915] Call Trace:
Apr 4 15:01:15 hostname kernel: [191113.073921] [<ffffffff81764799>] schedule+0x29/0x70
Apr 4 15:01:15 hostname kernel: [191113.073923] [<ffffffff81764abe>] schedule_preempt_disabled+0xe/0x10
Apr 4 15:01:15 hostname kernel: [191113.073926] [<ffffffff817668f4>] __mutex_lock_slowpath+0x114/0x1b0
Apr 4 15:01:15 hostname kernel: [191113.073927] [<ffffffff817669b3>] mutex_lock+0x23/0x37
Apr 4 15:01:15 hostname kernel: [191113.073930] [<ffffffff811da631>] do_last+0x281/0x7d0
Apr 4 15:01:15 hostname kernel: [191113.073931] [<ffffffff811dac44>] path_openat+0xc4/0x4c0
Apr 4 15:01:15 hostname kernel: [191113.073934] [<ffffffff811855eb>] ? __handle_mm_fault+0x1db/0x360
Apr 4 15:01:15 hostname kernel: [191113.073935] [<ffffffff81185823>] ? handle_mm_fault+0xb3/0x160
Apr 6 19:56:45 hostname kernel: [450264.877269] Out of memory: Kill process 26196 (python2.7) score 14 or sacrifice child
Apr 6 19:56:45 hostname kernel: [450264.877307] Killed process 26196 (python2.7) total-vm:76966004kB, anon-rss:88036kB, file-rss:170036kB
Apr 6 20:12:01 hostname kernel: [451123.424257] INFO: task cron:32543 blocked for more than 300 seconds.
Apr 6 20:12:01 hostname kernel: [451123.424286] Tainted: P OX 3.13.0-71-generic #114~precise1-Ubuntu
Apr 6 20:12:01 hostname kernel: [451123.424312] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 6 20:12:01 hostname kernel: [451123.424339] cron D ffffffff81811520 0 32543 1398 0x00000000
Apr 6 20:12:01 hostname kernel: [451123.424343] ffff880050453be8 0000000000000086 ffff880050453bd8 ffff880050453fd8
Apr 6 20:12:01 hostname kernel: [451123.424346] 0000000000013180 0000000000013180 ffff880873f20000 ffff88086e8d6000
Apr 6 20:12:01 hostname kernel: [451123.424348] 0000000000000286 ffff88086dedbb00 ffff88086dedbb04 00000000ffffffff
Apr 6 20:12:01 hostname kernel: [451123.424350] Call Trace:
Apr 6 20:12:01 hostname kernel: [451123.424356] [<ffffffff81764799>] schedule+0x29/0x70
Apr 6 20:12:01 hostname kernel: [451123.424359] [<ffffffff81764abe>] schedule_preempt_disabled+0xe/0x10
Apr 6 20:12:01 hostname kernel: [451123.424362] [<ffffffff817668f4>] __mutex_lock_slowpath+0x114/0x1b0
Apr 6 20:12:01 hostname kernel: [451123.424364] [<ffffffff817669b3>] mutex_lock+0x23/0x37
Apr 6 20:12:01 hostname kernel: [451123.424366] [<ffffffff811da631>] do_last+0x281/0x7d0
Apr 6 20:12:01 hostname kernel: [451123.424368] [<ffffffff811dac44>] path_openat+0xc4/0x4c0
Apr 6 20:12:01 hostname kernel: [451123.424371] [<ffffffff811855eb>] ? __handle_mm_fault+0x1db/0x360
Apr 6 20:12:01 hostname kernel: [451123.424373] [<ffffffff81185823>] ? handle_mm_fault+0xb3/0x160
Apr 6 20:12:01 hostname kernel: [451123.424375] [<ffffffff811dbed3>] do_filp_open+0x43/0xa0
Apr 6 20:12:01 hostname kernel: [451123.424378] [<ffffffff811e900e>] ? __alloc_fd+0xce/0x120
Apr 6 20:12:31 hostname kernel: [451123.424381] [<ffffffff811ca786>] do_sys_open+0x136/0x2a0
Apr 6 20:12:31 hostname kernel: [451123.424383] [<ffffffff811ca90e>] SyS_open+0x1e/0x20
Apr 6 20:12:31 hostname kernel: [451123.424387] [<ffffffff8177145d>] system_call_fastpath+0x1a/0x1f
这可能是糟糕的记忆:
Apr 5 19:58:53 hostname kernel: [462034.034881] apache2 invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Apr 5 19:58:53 hostname kernel: [462034.034885] apache2 cpuset=/ mems_allowed=0
Apr 5 19:58:53 hostname kernel: [462034.034888] CPU: 6 PID: 19720 Comm: apache2 Tainted: P OX 3.13.0-71-generic #114~precise1-Ubuntu
Apr 5 19:58:53 hostname kernel: [462034.034889] Hardware name: Supermicro C7Z87-OCE/C7Z87-OCE, BIOS 2.2 01/30/2015
Apr 5 19:58:53 hostname kernel: [462034.034890] 0000000000000000 ffff88089e46b888 ffffffff8175bca1 0000000000000007
Apr 5 19:58:53 hostname kernel: [462034.034893] ffff880203b91800 ffff88089e46b8d8 ffffffff8175172b ffff880800000000
Apr 5 19:58:53 hostname kernel: [462034.034895] 000201da81381898 ffff88001e730000 ffff880003f28000 0000000000000000
Apr 5 19:58:53 hostname kernel: [462034.034897] Call Trace:
Apr 5 19:58:53 hostname kernel: [462034.034902] [<ffffffff8175bca1>] dump_stack+0x46/0x58
Apr 5 19:58:53 hostname kernel: [462034.034905] [<ffffffff8175172b>] dump_header+0x7e/0xbd
Apr 5 19:58:53 hostname kernel: [462034.034907] [<ffffffff817517c1>] oom_kill_process.part.5+0x57/0x2d7
Apr 5 19:58:53 hostname kernel: [462034.034910] [<ffffffff8115cb27>] oom_kill_process+0x47/0x50
Apr 5 19:58:53 hostname kernel: [462034.034912] [<ffffffff8115ce65>] out_of_memory+0x145/0x1d0
Apr 5 19:58:53 hostname kernel: [462034.034915] [<ffffffff81162e17>] __alloc_pages_nodemask+0xab7/0xbb0
Apr 5 19:58:53 hostname kernel: [462034.034919] [<ffffffff811a4102>] alloc_pages_current+0xb2/0x170
Apr 5 19:58:53 hostname kernel: [462034.034921] [<ffffffff811591c7>] __page_cache_alloc+0xb7/0xd0
Apr 5 19:58:53 hostname kernel: [462034.034923] [<ffffffff8115afbd>] filemap_fault+0x28d/0x440
Apr 5 19:58:53 hostname kernel: [462034.034926] [<ffffffff811811ef>] __do_fault+0x6f/0x530
Apr 5 19:58:53 hostname kernel: [462034.034928] [<ffffffff81185046>] handle_pte_fault+0x96/0x230
Apr 5 19:58:53 hostname kernel: [462034.034930] [<ffffffff81764799>] ? schedule+0x29/0x70
Apr 5 19:58:53 hostname kernel: [462034.034932] [<ffffffff811855eb>] __handle_mm_fault+0x1db/0x360
Apr 5 19:58:53 hostname kernel: [462034.034934] [<ffffffff81185823>] handle_mm_fault+0xb3/0x160
Apr 5 19:58:53 hostname kernel: [462034.034937] [<ffffffff8176c720>] __do_page_fault+0x1b0/0x580
Apr 5 19:58:53 hostname kernel: [462034.034940] [<ffffffff8101ce89>] ? read_tsc+0x9/0x20
Apr 5 19:58:53 hostname kernel: [462034.034943] [<ffffffff810d329c>] ? ktime_get_ts+0x4c/0xe0
Apr 5 19:58:53 hostname kernel: [462034.034946] [<ffffffff811deb4d>] ? poll_select_copy_remaining+0xed/0x140
Apr 5 19:58:53 hostname kernel: [462034.034948] [<ffffffff8176cb0a>] do_page_fault+0x1a/0x70
Apr 5 19:58:53 hostname kernel: [462034.034950] [<ffffffff81768b28>] page_fault+0x28/0x30
答案1
因为我必须回答这个问题才能关闭它:根据 Michael Hampton 的评论,更新内核(至 .85)解决了这个问题。