16.04 配备 4.15.0-36-generic 内核:系统冻结,BTRFS 模块中的多个工作线程被阻止/挂起

16.04 配备 4.15.0-36-generic 内核:系统冻结,BTRFS 模块中的多个工作线程被阻止/挂起

我们是运行 Xenial (16.04) 和 Linux 内核 4.15.0-36-generic 的 Ubuntu 商店。我们最近从 Ubuntu 14.04/Linux 内核 4.4.0-23 迁移了我们的应用程序代码。我们在 16.04 环境中遇到了一些系统挂起问题(如下所述),而在 14.04 和 4.4 内核中没有遇到过这些问题。

我们的二级存储(256GB Micron SSD)安装在 /dev/sdb 上,文件系统为 BTRFS。在执行大型文件复制/修改操作时,系统会挂起,系统控制台 tty 上会显示以下消息:

> [147778.018904] INFO: task btrfs-transacti:440 blocked for more than
> 120 seconds. [147778.026992]       Tainted: P           OE   
> 4.15.0-36-generic #1 [147778.033914] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [147778.042918] INFO: task auditd:1138 blocked for more than 120
> seconds. [147778.050218]       Tainted: P           OE   
> 4.15.0-36-generic #1 [147778.057116] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [147778.066125] INFO: task cron:24933 blocked for more than 120
> seconds. [147778.073317]       Tainted: P           OE   
> 4.15.0-36-generic #1 [147778.080225] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [147778.089664] INFO: task cron:24934 blocked for more than 120
> seconds. [147778.096866]       Tainted: P           OE   
> 4.15.0-36-generic #1 [147778.103764] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [147778.113159] INFO: task cron:24935 blocked for more than 120
> seconds. [147778.120360]       Tainted: P           OE   
> 4.15.0-36-generic #1 [147778.127267] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [147898.850483] INFO: task btrfs-transacti:440 blocked for more than
> 120 seconds. [147898.858569]       Tainted: P           OE   
> 4.15.0-36-generic #1 [147898.865490] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

这种情况有时会在关机(重启)或系统启动期间发生。在此期间,我们在后台确实有几个文件操作正在进行。我们能够从仍处于活动状态的 ssh 会话中收集“dmesg”输出:

    [ 1034.371130] perf: interrupt took too long (14517 > 14503), lowering kernel.perf_event_max_sample_rate to 13750
[ 1157.254266] BTRFS info (device sdb3): disk space caching is enabled
[ 1157.290278] BTRFS info (device sdb3): disk space caching is enabled
[ 1157.312982] BTRFS info (device sdb3): disk space caching is enabled
[ 1157.332657] BTRFS info (device sdb3): disk space caching is enabled
[ 1248.012861] ata1.00: Enabling discard_zeroes_data
[ 1248.020748]  sda: sda1
                sda1: <solaris: [s0] sda5 [s2] sda6 [s8] sda7 >
[ 1257.996027] BTRFS info (device sdb3): qgroup scan completed (inconsistency flag cleared)
[ 1451.046237] INFO: task kworker/u24:0:5 blocked for more than 120 seconds.
[ 1451.054217]       Tainted: P           OE    4.15.0-36-generic #1
[ 1451.061152] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1451.069963] kworker/u24:0   D    0     5      2 0x80000000
[ 1451.070199] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
[ 1451.070257] Call Trace:
[ 1451.070312]  __schedule+0x3d6/0x8b0
[ 1451.070332]  ? blk_rq_map_sg+0x13c/0x550
[ 1451.070336]  schedule+0x36/0x80
[ 1451.070366]  btrfs_tree_lock+0xef/0x210 [btrfs]
[ 1451.070381]  ? wait_woken+0x80/0x80
[ 1451.070400]  btrfs_lock_root_node+0x34/0x50 [btrfs]
[ 1451.070417]  btrfs_search_slot+0x914/0x9d0 [btrfs]
[ 1451.070421]  ? update_load_avg+0x5e0/0x700
[ 1451.070442]  btrfs_lookup_file_extent+0x49/0x60 [btrfs]
[ 1451.070465]  __btrfs_drop_extents+0x19e/0xe60 [btrfs]
[ 1451.070488]  ? __set_extent_bit+0x466/0x5a0 [btrfs]
[ 1451.070510]  ? __set_extent_bit+0x466/0x5a0 [btrfs]
[ 1451.070513]  ? _cond_resched+0x1a/0x50
[ 1451.070515]  ? _cond_resched+0x1a/0x50
[ 1451.070535]  insert_reserved_file_extent.constprop.68+0x90/0x2d0 [btrfs]
[ 1451.070557]  ? start_transaction+0x9b/0x440 [btrfs]
[ 1451.070578]  btrfs_finish_ordered_io+0x300/0x720 [btrfs]
[ 1451.070601]  ? end_compressed_bio_write+0x102/0x140 [btrfs]
[ 1451.070622]  finish_ordered_fn+0x15/0x20 [btrfs]
[ 1451.070644]  normal_work_helper+0xcb/0x320 [btrfs]
[ 1451.070667]  btrfs_endio_write_helper+0x12/0x20 [btrfs]
[ 1451.070672]  process_one_work+0x14d/0x410
[ 1451.070675]  worker_thread+0x4b/0x460
[ 1451.070679]  kthread+0x105/0x140
[ 1451.070680]  ? process_one_work+0x410/0x410
[ 1451.070683]  ? kthread_destroy_worker+0x50/0x50
[ 1451.070686]  ret_from_fork+0x35/0x40
[ 1451.070733] INFO: task kworker/u24:1:101 blocked for more than 120 seconds.
[ 1451.078559]       Tainted: P           OE    4.15.0-36-generic #1
[ 1451.085426] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1451.094228] kworker/u24:1   D    0   101      2 0x80000000
[ 1451.094275] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
[ 1451.094278] Call Trace:
[ 1451.094282]  __schedule+0x3d6/0x8b0
[ 1451.094285]  schedule+0x36/0x80
[ 1451.094307]  btrfs_tree_lock+0xef/0x210 [btrfs]
[ 1451.094310]  ? wait_woken+0x80/0x80
[ 1451.094327]  btrfs_lock_root_node+0x34/0x50 [btrfs]
[ 1451.094343]  btrfs_search_slot+0x914/0x9d0 [btrfs]
[ 1451.094363]  btrfs_lookup_file_extent+0x49/0x60 [btrfs]
[ 1451.094384]  __btrfs_drop_extents+0x19e/0xe60 [btrfs]
[ 1451.094407]  ? __set_extent_bit+0x466/0x5a0 [btrfs]
[ 1451.094428]  ? __set_extent_bit+0x466/0x5a0 [btrfs]
[ 1451.094431]  ? _cond_resched+0x1a/0x50
[ 1451.094434]  ? _cond_resched+0x1a/0x50
[ 1451.094455]  insert_reserved_file_extent.constprop.68+0x90/0x2d0 [btrfs]
[ 1451.094475]  ? start_transaction+0x9b/0x440 [btrfs]
[ 1451.094495]  btrfs_finish_ordered_io+0x300/0x720 [btrfs]
[ 1451.094499]  ? __switch_to_asm+0x40/0x70
[ 1451.094503]  ? __switch_to_asm+0x34/0x70
[ 1451.094505]  ? __switch_to_asm+0x40/0x70
[ 1451.094524]  finish_ordered_fn+0x15/0x20 [btrfs]
[ 1451.094546]  normal_work_helper+0xcb/0x320 [btrfs]
[ 1451.094569]  btrfs_endio_write_helper+0x12/0x20 [btrfs]
[ 1451.094571]  process_one_work+0x14d/0x410
[ 1451.094575]  worker_thread+0x4b/0x460
[ 1451.094578]  kthread+0x105/0x140
[ 1451.094580]  ? process_one_work+0x410/0x410
[ 1451.094582]  ? kthread_destroy_worker+0x50/0x50
[ 1451.094585]  ret_from_fork+0x35/0x40
[ 1451.094603] INFO: task kworker/u24:2:271 blocked for more than 120 seconds.
[ 1451.102453]       Tainted: P           OE    4.15.0-36-generic #1
[ 1451.109330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1451.118138] kworker/u24:2   D    0   271      2 0x80000000
[ 1451.118165] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
[ 1451.118173] Call Trace:
[ 1451.118180]  __schedule+0x3d6/0x8b0
[ 1451.118183]  schedule+0x36/0x80
[ 1451.118205]  btrfs_tree_lock+0xef/0x210 [btrfs]
[ 1451.118215]  ? wait_woken+0x80/0x80
[ 1451.118231]  btrfs_lock_root_node+0x34/0x50 [btrfs]
[ 1451.118252]  btrfs_search_slot+0x914/0x9d0 [btrfs]
[ 1451.118273]  btrfs_lookup_file_extent+0x49/0x60 [btrfs]
[ 1451.118296]  __btrfs_drop_extents+0x19e/0xe60 [btrfs]
[ 1451.118319]  ? __set_extent_bit+0x466/0x5a0 [btrfs]
[ 1451.118342]  ? __set_extent_bit+0x466/0x5a0 [btrfs]
[ 1451.118346]  ? _cond_resched+0x1a/0x50
[ 1451.118368]  insert_reserved_file_extent.constprop.68+0x90/0x2d0 [btrfs]
[ 1451.118389]  ? start_transaction+0x9b/0x440 [btrfs]
[ 1451.118412]  btrfs_finish_ordered_io+0x300/0x720 [btrfs]
[ 1451.118435]  finish_ordered_fn+0x15/0x20 [btrfs]
[ 1451.118459]  normal_work_helper+0xcb/0x320 [btrfs]
[ 1451.118484]  btrfs_endio_write_helper+0x12/0x20 [btrfs]
[ 1451.118487]  process_one_work+0x14d/0x410
[ 1451.118490]  worker_thread+0x4b/0x460
[ 1451.118493]  kthread+0x105/0x140
[ 1451.118495]  ? process_one_work+0x410/0x410
[ 1451.118498]  ? kthread_destroy_worker+0x50/0x50
[ 1451.118501]  ret_from_fork+0x35/0x40

以下是我们的 /etc/fstab 条目,显示我们已经逐一尝试过的各种 BTRFS 挂载选项:

UUID=3a899529-c5e4-42ae-a4ea-7b645f9bcb59 /               btrfs defaults,autodefrag,thread_pool=4,compress=lzo,noatime,subvol=@/netvisor-1 0       1
UUID=3a899529-c5e4-42ae-a4ea-7b645f9bcb59 /home           btrfs defaults,autodefrag,thread_pool=4,compress=lzo,noatime,subvol=@home 0       2
UUID=3a899529-c5e4-42ae-a4ea-7b645f9bcb59 /var/nvOS/log   btrfs defaults,autodefrag,thread_pool=4,compress=lzo,noatime,subvol=@var_nvOS_log 0       2
UUID=3a899529-c5e4-42ae-a4ea-7b645f9bcb59 /.rootbe btrfs defaults,autodefrag,thread_pool=4,compress=lzo,noatime,subvol=@ 0 1

这是否表明我们遇到了一些已知的 BTRFS 文件系统死锁问题,而这些问题可能已在后续的内核版本中得到修复?如果是这样,我们应该如何在当前内核中获取这些修复?目前,升级内核对我们来说不是一个选择。是否有任何其他 BTRFS 挂载选项或任何其他系统可调参数可以尝试规避此挂起问题。提前致谢。

相关内容