Debian/在加密 LVM 上测试 Linux - 文件系统密集型操作导致系统冻结并崩溃系统日志服务

Debian/在加密 LVM 上测试 Linux - 文件系统密集型操作导致系统冻结并崩溃系统日志服务

我在多台机器上使用 Debian/testing Linux。出于安全原因,我总是将其安装在加密的 LVM 上。通常我使用大小为 1TB 到 3TB 的 ext4 文件系统。

不幸的是,它会产生一些非常严重的副作用。

当我执行文件系统密集型操作时(例如,并行编译几个学生的 Buildroot 项目、以 tar.xz 格式存档 20 GB 的数据,或安装 Xilinx Vivado 等软件,该软件会将约 130 GB 的数据写入文件系统),系统会定期冻结约 2 分钟。当我运行 dmesg 时,我收到以下烦人的消息:

[ 8648.672075] systemd[1]: systemd-journald.service: Processes still around after final SIGKILL. Entering failed mode.
[ 8648.672083] systemd[1]: systemd-journald.service: Failed with result 'timeout'.
[ 8648.672140] systemd[1]: systemd-journald.service: Unit process 7708 (systemd-journal) remains running after unit stopped.
[ 8648.672299] systemd[1]: Failed to start Journal Service.
[ 8648.672679] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 14.
[ 8648.672911] systemd[1]: Stopped Journal Service.
[ 8648.672980] systemd[1]: systemd-journald.service: Found left-over process 7708 (systemd-journal) in control group while starting unit. Ignoring.
[ 8648.672983] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[ 8648.673699] systemd[1]: Starting Journal Service...
[ 8738.922289] systemd[1]: systemd-journald.service: start operation timed out. Terminating.
[ 8828.923063] systemd[1]: systemd-journald.service: State 'stop-sigterm' timed out. Killing.
[ 8828.923105] systemd[1]: systemd-journald.service: Killing process 7854 (systemd-journal) with signal SIGKILL.
[ 8828.923141] systemd[1]: systemd-journald.service: Killing process 7708 (systemd-journal) with signal SIGKILL.
[ 8919.173428] systemd[1]: systemd-journald.service: Processes still around after SIGKILL. Ignoring.
[ 9009.423787] systemd[1]: systemd-journald.service: State 'final-sigterm' timed out. Killing.
[ 9009.423831] systemd[1]: systemd-journald.service: Killing process 7854 (systemd-journal) with signal SIGKILL.
[ 9099.674142] systemd[1]: systemd-journald.service: Processes still around after final SIGKILL. Entering failed mode.
[ 9099.674173] systemd[1]: systemd-journald.service: Failed with result 'timeout'.
[ 9099.674241] systemd[1]: systemd-journald.service: Unit process 7854 (systemd-journal) remains running after unit stopped.
[ 9099.674477] systemd[1]: Failed to start Journal Service.
[ 9099.674924] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 15.
[ 9099.675102] systemd[1]: Stopped Journal Service.
[ 9099.675185] systemd[1]: systemd-journald.service: Found left-over process 7854 (systemd-journal) in control group while starting unit. Ignoring.
[ 9099.675209] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[ 9099.675958] systemd[1]: Starting Journal Service...

有趣的是,该问题在具有更大 RAM 的机器上更频繁地发生(我使用具有以下 RAM 容量的机器:16GB(Intel i7 CPU)、32GB(Intel i7 CPU)或 64GB(Intel Xeon E-2176M CPU))。

我发现了类似的旧问题报告,例如:

基于上述情况,我已检查我是否使用了截止期限调度程序:

# cat /sys/block/sda/queue/scheduler
[mq-deadline] none

我也使用相当新的内核:

# uname -a
Linux WZabHP 5.10.0-8-amd64 #1 SMP Debian 5.10.46-2 (2021-07-20) x86_64 GNU/Linux

我甚至尝试在其中一台机器(32 GB,i7 CPU)上切换到 RT 内核,但无济于事。更糟糕的是 - 冻结导致文件系统损坏(这种影响重复了两次,之后我停止尝试使用 RT 内核)。

上述问题可能由什么原因造成?我应该检查/调整哪些设置来解决这个问题?

更新

我发现另一组帖子表明虚拟内存的参数可能与所描述的问题有关:

不过,调整vm.dirty_background_ratio和 似乎vm.dirty_ratio需要分析磁盘的速度。

更新 2 在发生死亡问题之前journald,内核日志中会出现以下内容(这是第一个与超时相关的错误):

Jul 26 20:44:38 WZabHP kernel: [  484.449978] INFO: task StreamTrans #6:2335 blocked for more than 120 seconds.
Jul 26 20:44:38 WZabHP kernel: [  484.449981]       Not tainted 5.10.0-8-amd64 #1 Debian 5.10.46-2
Jul 26 20:44:38 WZabHP kernel: [  484.449981] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 26 20:44:38 WZabHP kernel: [  484.449982] task:StreamTrans #6  state:D stack:    0 pid: 2335 ppid:  1945 flags:0x00004000
Jul 26 20:44:38 WZabHP kernel: [  484.449985] Call Trace:
Jul 26 20:44:38 WZabHP kernel: [  484.449989]  __schedule+0x282/0x870
Jul 26 20:44:38 WZabHP kernel: [  484.449991]  schedule+0x46/0xb0
Jul 26 20:44:38 WZabHP kernel: [  484.449997]  jbd2_log_wait_commit+0xac/0x120 [jbd2]
Jul 26 20:44:38 WZabHP kernel: [  484.450000]  ? add_wait_queue_exclusive+0x70/0x70
Jul 26 20:44:38 WZabHP kernel: [  484.450013]  ext4_sync_file+0xd4/0x350 [ext4]
Jul 26 20:44:38 WZabHP kernel: [  484.450016]  __x64_sys_fsync+0x34/0x60
Jul 26 20:44:38 WZabHP kernel: [  484.450017]  do_syscall_64+0x33/0x80
Jul 26 20:44:38 WZabHP kernel: [  484.450019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 26 20:44:38 WZabHP kernel: [  484.450021] RIP: 0033:0x7fd2cf3385eb
Jul 26 20:44:38 WZabHP kernel: [  484.450022] RSP: 002b:00007fd2b8d528c0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Jul 26 20:44:38 WZabHP kernel: [  484.450023] RAX: ffffffffffffffda RBX: 00007fd2af8539d0 RCX: 00007fd2cf3385eb
Jul 26 20:44:38 WZabHP kernel: [  484.450024] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000000000000003f
Jul 26 20:44:38 WZabHP kernel: [  484.450025] RBP: 00007fd2af506ba0 R08: 0000000000000000 R09: 00007fd2cbbe4f28
Jul 26 20:44:38 WZabHP kernel: [  484.450025] R10: 000200050000002a R11: 0000000000000293 R12: 0000000000000000
Jul 26 20:44:38 WZabHP kernel: [  484.450026] R13: 00000000000000c4 R14: 00007fd2caa6a6f1 R15: 00000000000000c4
Jul 26 20:44:38 WZabHP kernel: [  484.450030] INFO: task mozStorage #2:2430 blocked for more than 120 seconds.
Jul 26 20:44:38 WZabHP kernel: [  484.450031]       Not tainted 5.10.0-8-amd64 #1 Debian 5.10.46-2
Jul 26 20:44:38 WZabHP kernel: [  484.450031] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 26 20:44:38 WZabHP kernel: [  484.450032] task:mozStorage #2   state:D stack:    0 pid: 2430 ppid:  1945 flags:0x00004000
Jul 26 20:44:38 WZabHP kernel: [  484.450033] Call Trace:
Jul 26 20:44:38 WZabHP kernel: [  484.450034]  __schedule+0x282/0x870
Jul 26 20:44:38 WZabHP kernel: [  484.450036]  schedule+0x46/0xb0
Jul 26 20:44:38 WZabHP kernel: [  484.450039]  jbd2_log_wait_commit+0xac/0x120 [jbd2]
Jul 26 20:44:38 WZabHP kernel: [  484.450040]  ? add_wait_queue_exclusive+0x70/0x70
Jul 26 20:44:38 WZabHP kernel: [  484.450047]  ext4_sync_file+0xd4/0x350 [ext4]
Jul 26 20:44:38 WZabHP kernel: [  484.450049]  __x64_sys_fsync+0x34/0x60
Jul 26 20:44:38 WZabHP kernel: [  484.450050]  do_syscall_64+0x33/0x80
Jul 26 20:44:38 WZabHP kernel: [  484.450051]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 26 20:44:38 WZabHP kernel: [  484.450052] RIP: 0033:0x7fd2cf3385eb
Jul 26 20:44:38 WZabHP kernel: [  484.450053] RSP: 002b:00007fd2b334c300 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Jul 26 20:44:38 WZabHP kernel: [  484.450054] RAX: ffffffffffffffda RBX: 00007fd2b34a6858 RCX: 00007fd2cf3385eb
Jul 26 20:44:38 WZabHP kernel: [  484.450055] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000073
Jul 26 20:44:38 WZabHP kernel: [  484.450055] RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000
Jul 26 20:44:38 WZabHP kernel: [  484.450056] R10: 00007ffd1278a080 R11: 0000000000000293 R12: 00000000000001f7
Jul 26 20:44:38 WZabHP kernel: [  484.450057] R13: 00007fd2aeffbd40 R14: 00007fd2b34a67a0 R15: 000000000000004e
Jul 26 20:48:40 WZabHP kernel: [  726.109957] INFO: task jbd2/dm-2-8:591 blocked for more than 120 seconds.
Jul 26 20:48:40 WZabHP kernel: [  726.109959]       Not tainted 5.10.0-8-amd64 #1 Debian 5.10.46-2
Jul 26 20:48:40 WZabHP kernel: [  726.109960] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 26 20:48:40 WZabHP kernel: [  726.109961] task:jbd2/dm-2-8     state:D stack:    0 pid:  591 ppid:     2 flags:0x00004000
Jul 26 20:48:40 WZabHP kernel: [  726.109963] Call Trace:
Jul 26 20:48:40 WZabHP kernel: [  726.109967]  __schedule+0x282/0x870
Jul 26 20:48:40 WZabHP kernel: [  726.109969]  ? out_of_line_wait_on_bit_lock+0xb0/0xb0
Jul 26 20:48:40 WZabHP kernel: [  726.109970]  schedule+0x46/0xb0
Jul 26 20:48:40 WZabHP kernel: [  726.109971]  io_schedule+0x42/0x70
Jul 26 20:48:40 WZabHP kernel: [  726.109973]  bit_wait_io+0xd/0x50
Jul 26 20:48:40 WZabHP kernel: [  726.109974]  __wait_on_bit+0x2a/0x90
Jul 26 20:48:40 WZabHP kernel: [  726.109975]  out_of_line_wait_on_bit+0x92/0xb0
Jul 26 20:48:40 WZabHP kernel: [  726.109977]  ? var_wake_function+0x20/0x20
Jul 26 20:48:40 WZabHP kernel: [  726.109982]  jbd2_journal_commit_transaction+0x16b3/0x1ad0 [jbd2]
Jul 26 20:48:40 WZabHP kernel: [  726.109987]  kjournald2+0xab/0x270 [jbd2]
Jul 26 20:48:40 WZabHP kernel: [  726.109988]  ? add_wait_queue_exclusive+0x70/0x70
Jul 26 20:48:40 WZabHP kernel: [  726.109991]  ? load_superblock.part.0+0xb0/0xb0 [jbd2]
Jul 26 20:48:40 WZabHP kernel: [  726.109993]  kthread+0x11b/0x140
Jul 26 20:48:40 WZabHP kernel: [  726.109994]  ? __kthread_bind_mask+0x60/0x60
Jul 26 20:48:40 WZabHP kernel: [  726.109996]  ret_from_fork+0x22/0x30
Jul 26 20:48:40 WZabHP kernel: [  726.109999] INFO: task journal-offline:2579 blocked for more than 120 seconds.
Jul 26 20:48:40 WZabHP kernel: [  726.109999]       Not tainted 5.10.0-8-amd64 #1 Debian 5.10.46-2
Jul 26 20:48:40 WZabHP kernel: [  726.110000] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 26 20:48:40 WZabHP kernel: [  726.110001] task:journal-offline state:D stack:    0 pid: 2579 ppid:     1 flags:0x00004324
Jul 26 20:48:40 WZabHP kernel: [  726.110002] Call Trace:
Jul 26 20:48:40 WZabHP kernel: [  726.110003]  __schedule+0x282/0x870
Jul 26 20:48:40 WZabHP kernel: [  726.110004]  schedule+0x46/0xb0
Jul 26 20:48:40 WZabHP kernel: [  726.110007]  jbd2_log_wait_commit+0xac/0x120 [jbd2]
Jul 26 20:48:40 WZabHP kernel: [  726.110009]  ? add_wait_queue_exclusive+0x70/0x70
Jul 26 20:48:40 WZabHP kernel: [  726.110021]  ext4_sync_file+0xd4/0x350 [ext4]
Jul 26 20:48:40 WZabHP kernel: [  726.110024]  __x64_sys_fsync+0x34/0x60
Jul 26 20:48:40 WZabHP kernel: [  726.110025]  do_syscall_64+0x33/0x80
Jul 26 20:48:40 WZabHP kernel: [  726.110027]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 26 20:48:40 WZabHP kernel: [  726.110029] RIP: 0033:0x7f68c72ebabb
Jul 26 20:48:40 WZabHP kernel: [  726.110030] RSP: 002b:00007f68be229cf0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Jul 26 20:48:40 WZabHP kernel: [  726.110031] RAX: ffffffffffffffda RBX: 000056427bbd8870 RCX: 00007f68c72ebabb
Jul 26 20:48:40 WZabHP kernel: [  726.110032] RDX: 00007f68c5a2b000 RSI: 00007f68c7649414 RDI: 000000000000001a
Jul 26 20:48:40 WZabHP kernel: [  726.110032] RBP: 00007f68c764bd30 R08: 0000000000000000 R09: 00007f68be22a700
Jul 26 20:48:40 WZabHP kernel: [  726.110033] R10: 0000000000000014 R11: 0000000000000293 R12: 0000000000000002
Jul 26 20:48:40 WZabHP kernel: [  726.110034] R13: 00007ffe691a51df R14: 00007f68be229e00 R15: 000056427bbf9920
Jul 26 20:48:40 WZabHP kernel: [  726.110035] INFO: task journal-offline:2580 blocked for more than 120 seconds.
Jul 26 20:48:40 WZabHP kernel: [  726.110036]       Not tainted 5.10.0-8-amd64 #1 Debian 5.10.46-2
Jul 26 20:48:40 WZabHP kernel: [  726.110036] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 26 20:48:40 WZabHP kernel: [  726.110037] task:journal-offline state:D stack:    0 pid: 2580 ppid:     1 flags:0x00004324
Jul 26 20:48:40 WZabHP kernel: [  726.110038] Call Trace:
Jul 26 20:48:40 WZabHP kernel: [  726.110039]  __schedule+0x282/0x870
Jul 26 20:48:40 WZabHP kernel: [  726.110040]  schedule+0x46/0xb0
Jul 26 20:48:40 WZabHP kernel: [  726.110043]  jbd2_log_wait_commit+0xac/0x120 [jbd2]
Jul 26 20:48:40 WZabHP kernel: [  726.110045]  ? add_wait_queue_exclusive+0x70/0x70
Jul 26 20:48:40 WZabHP kernel: [  726.110051]  ext4_sync_file+0xd4/0x350 [ext4]
Jul 26 20:48:40 WZabHP kernel: [  726.110053]  __x64_sys_fsync+0x34/0x60
Jul 26 20:48:40 WZabHP kernel: [  726.110054]  do_syscall_64+0x33/0x80
Jul 26 20:48:40 WZabHP kernel: [  726.110056]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 26 20:48:40 WZabHP kernel: [  726.110056] RIP: 0033:0x7f68c72ebabb
Jul 26 20:48:40 WZabHP kernel: [  726.110057] RSP: 002b:00007f68bda28cf0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Jul 26 20:48:40 WZabHP kernel: [  726.110058] RAX: ffffffffffffffda RBX: 000056427bc095d0 RCX: 00007f68c72ebabb
Jul 26 20:48:40 WZabHP kernel: [  726.110059] RDX: 00007f68c622b000 RSI: 00007f68c7649414 RDI: 0000000000000026
Jul 26 20:48:40 WZabHP kernel: [  726.110059] RBP: 00007f68c764bd30 R08: 0000000000000000 R09: 00007f68bda29700
Jul 26 20:48:40 WZabHP kernel: [  726.110060] R10: 0000000000000014 R11: 0000000000000293 R12: 0000000000000002
Jul 26 20:48:40 WZabHP kernel: [  726.110061] R13: 00007ffe691a51df R14: 00007f68bda28e00 R15: 0000000000802000
Jul 26 20:48:40 WZabHP kernel: [  726.110064] INFO: task NetworkManager:1090 blocked for more than 120 seconds.
Jul 26 20:48:40 WZabHP kernel: [  726.110065]       Not tainted 5.10.0-8-amd64 #1 Debian 5.10.46-2
Jul 26 20:48:40 WZabHP kernel: [  726.110065] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 26 20:48:40 WZabHP kernel: [  726.110066] task:NetworkManager  state:D stack:    0 pid: 1090 ppid:     1 flags:0x00004000
Jul 26 20:48:40 WZabHP kernel: [  726.110067] Call Trace:
Jul 26 20:48:40 WZabHP kernel: [  726.110068]  __schedule+0x282/0x870
Jul 26 20:48:40 WZabHP kernel: [  726.110069]  schedule+0x46/0xb0
Jul 26 20:48:40 WZabHP kernel: [  726.110072]  jbd2_log_wait_commit+0xac/0x120 [jbd2]
Jul 26 20:48:40 WZabHP kernel: [  726.110073]  ? add_wait_queue_exclusive+0x70/0x70
Jul 26 20:48:40 WZabHP kernel: [  726.110080]  ext4_sync_file+0xd4/0x350 [ext4]
Jul 26 20:48:40 WZabHP kernel: [  726.110081]  __x64_sys_fsync+0x34/0x60
Jul 26 20:48:40 WZabHP kernel: [  726.110083]  do_syscall_64+0x33/0x80
Jul 26 20:48:40 WZabHP kernel: [  726.110084]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 26 20:48:40 WZabHP kernel: [  726.110085] RIP: 0033:0x7fd15da1c5eb
Jul 26 20:48:40 WZabHP kernel: [  726.110085] RSP: 002b:00007ffedee571c0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Jul 26 20:48:40 WZabHP kernel: [  726.110086] RAX: ffffffffffffffda RBX: 00007fd15cd06928 RCX: 00007fd15da1c5eb
Jul 26 20:48:40 WZabHP kernel: [  726.110087] RDX: 0000000000000184 RSI: 000055c28ee17560 RDI: 000000000000001a
Jul 26 20:48:40 WZabHP kernel: [  726.110088] RBP: 000000000000001a R08: 0000000000000000 R09: 00007ffedee572d0
Jul 26 20:48:40 WZabHP kernel: [  726.110088] R10: 0000000000000184 R11: 0000000000000293 R12: 00007ffedee572d0
Jul 26 20:48:40 WZabHP kernel: [  726.110089] R13: 000055c28ed78fd0 R14: 000055c28ee176e4 R15: 0000000000000000
Jul 26 20:48:40 WZabHP kernel: [  726.110104] INFO: task Permission:2299 blocked for more than 120 seconds.
Jul 26 20:48:40 WZabHP kernel: [  726.110105]       Not tainted 5.10.0-8-amd64 #1 Debian 5.10.46-2
Jul 26 20:48:40 WZabHP kernel: [  726.110105] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 26 20:48:40 WZabHP kernel: [  726.110106] task:Permission      state:D stack:    0 pid: 2299 ppid:  1945 flags:0x00004000
Jul 26 20:48:40 WZabHP kernel: [  726.110107] Call Trace:
Jul 26 20:48:40 WZabHP kernel: [  726.110108]  __schedule+0x282/0x870
Jul 26 20:48:40 WZabHP kernel: [  726.110109]  schedule+0x46/0xb0
Jul 26 20:48:40 WZabHP kernel: [  726.110112]  jbd2_log_wait_commit+0xac/0x120 [jbd2]
Jul 26 20:48:40 WZabHP kernel: [  726.110113]  ? add_wait_queue_exclusive+0x70/0x70
Jul 26 20:48:40 WZabHP kernel: [  726.110120]  ext4_sync_file+0xd4/0x350 [ext4]
Jul 26 20:48:40 WZabHP kernel: [  726.110122]  __x64_sys_fsync+0x34/0x60
Jul 26 20:48:40 WZabHP kernel: [  726.110123]  do_syscall_64+0x33/0x80
Jul 26 20:48:40 WZabHP kernel: [  726.110124]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 26 20:48:40 WZabHP kernel: [  726.110125] RIP: 0033:0x7fd2cf3385eb
Jul 26 20:48:40 WZabHP kernel: [  726.110125] RSP: 002b:00007fd2beffe3e0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Jul 26 20:48:40 WZabHP kernel: [  726.110126] RAX: ffffffffffffffda RBX: 00007fd2bf0582e0 RCX: 00007fd2cf3385eb
Jul 26 20:48:40 WZabHP kernel: [  726.110127] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000089
Jul 26 20:48:40 WZabHP kernel: [  726.110128] RBP: 0000000000000002 R08: 0000000000000000 R09: 00007fd2cec47af0
Jul 26 20:48:40 WZabHP kernel: [  726.110128] R10: 00007ffd1278a080 R11: 0000000000000293 R12: 00000000000001f5
Jul 26 20:48:40 WZabHP kernel: [  726.110129] R13: 0000000000000000 R14: 0000000000010400 R15: 000000000000003f
Jul 26 20:48:40 WZabHP kernel: [  726.110131] INFO: task StreamTrans #3:2319 blocked for more than 120 seconds.
Jul 26 20:48:40 WZabHP kernel: [  726.110132]       Not tainted 5.10.0-8-amd64 #1 Debian 5.10.46-2
Jul 26 20:48:40 WZabHP kernel: [  726.110132] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 26 20:48:40 WZabHP kernel: [  726.110133] task:StreamTrans #3  state:D stack:    0 pid: 2319 ppid:  1945 flags:0x00004000
Jul 26 20:48:40 WZabHP kernel: [  726.110134] Call Trace:
Jul 26 20:48:40 WZabHP kernel: [  726.110135]  __schedule+0x282/0x870
Jul 26 20:48:40 WZabHP kernel: [  726.110136]  schedule+0x46/0xb0
Jul 26 20:48:40 WZabHP kernel: [  726.110139]  jbd2_log_wait_commit+0xac/0x120 [jbd2]
Jul 26 20:48:40 WZabHP kernel: [  726.110140]  ? add_wait_queue_exclusive+0x70/0x70
Jul 26 20:48:40 WZabHP kernel: [  726.110146]  ext4_sync_file+0xd4/0x350 [ext4]
Jul 26 20:48:40 WZabHP kernel: [  726.110148]  __x64_sys_fsync+0x34/0x60
Jul 26 20:48:40 WZabHP kernel: [  726.110149]  do_syscall_64+0x33/0x80
Jul 26 20:48:40 WZabHP kernel: [  726.110150]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 26 20:48:40 WZabHP kernel: [  726.110151] RIP: 0033:0x7fd2cf3385eb
Jul 26 20:48:40 WZabHP kernel: [  726.110152] RSP: 002b:00007fd2bba8e8c0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Jul 26 20:48:40 WZabHP kernel: [  726.110153] RAX: ffffffffffffffda RBX: 00007fd2af8539d0 RCX: 00007fd2cf3385eb
Jul 26 20:48:40 WZabHP kernel: [  726.110153] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000000000000003f
Jul 26 20:48:40 WZabHP kernel: [  726.110154] RBP: 00007fd2aecd16d0 R08: 0000000000000000 R09: 00007fd2cbbe4f28
Jul 26 20:48:40 WZabHP kernel: [  726.110155] R10: 0002000500000049 R11: 0000000000000293 R12: 0000000000000000
Jul 26 20:48:40 WZabHP kernel: [  726.110155] R13: 00000000000000c4 R14: 00007fd2caa6a6f1 R15: 00000000000000c4
Jul 26 20:48:40 WZabHP kernel: [  726.110157] INFO: task QuotaManager IO:2333 blocked for more than 120 seconds.
Jul 26 20:48:40 WZabHP kernel: [  726.110158]       Not tainted 5.10.0-8-amd64 #1 Debian 5.10.46-2
Jul 26 20:48:40 WZabHP kernel: [  726.110158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 26 20:48:40 WZabHP kernel: [  726.110159] task:QuotaManager IO state:D stack:    0 pid: 2333 ppid:  1945 flags:0x00004000
Jul 26 20:48:40 WZabHP kernel: [  726.110160] Call Trace:
Jul 26 20:48:40 WZabHP kernel: [  726.110161]  __schedule+0x282/0x870
Jul 26 20:48:40 WZabHP kernel: [  726.110162]  schedule+0x46/0xb0
Jul 26 20:48:40 WZabHP kernel: [  726.110165]  jbd2_log_wait_commit+0xac/0x120 [jbd2]
Jul 26 20:48:40 WZabHP kernel: [  726.110166]  ? add_wait_queue_exclusive+0x70/0x70
Jul 26 20:48:40 WZabHP kernel: [  726.110172]  ext4_sync_file+0xd4/0x350 [ext4]
Jul 26 20:48:40 WZabHP kernel: [  726.110173]  __x64_sys_fsync+0x34/0x60
Jul 26 20:48:40 WZabHP kernel: [  726.110175]  do_syscall_64+0x33/0x80
Jul 26 20:48:40 WZabHP kernel: [  726.110176]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 26 20:48:40 WZabHP kernel: [  726.110177] RIP: 0033:0x7fd2cf3385eb
Jul 26 20:48:40 WZabHP kernel: [  726.110177] RSP: 002b:00007fd2b8d93020 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Jul 26 20:48:40 WZabHP kernel: [  726.110178] RAX: ffffffffffffffda RBX: 00007fd2ba6f52e0 RCX: 00007fd2cf3385eb
Jul 26 20:48:40 WZabHP kernel: [  726.110179] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000047
Jul 26 20:48:40 WZabHP kernel: [  726.110179] RBP: 0000000000000002 R08: 0000000000000000 R09: 00007fd2b8c1b910
Jul 26 20:48:40 WZabHP kernel: [  726.110180] R10: 00007ffd1278a080 R11: 0000000000000293 R12: 00000000000001f5
Jul 26 20:48:40 WZabHP kernel: [  726.110180] R13: 0000000000000000 R14: 0000000000000a00 R15: 000000000000003b
Jul 26 20:48:40 WZabHP kernel: [  726.110184] INFO: task mozStorage #2:2430 blocked for more than 120 seconds.
Jul 26 20:48:40 WZabHP kernel: [  726.110185]       Not tainted 5.10.0-8-amd64 #1 Debian 5.10.46-2
Jul 26 20:48:40 WZabHP kernel: [  726.110185] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 26 20:48:40 WZabHP kernel: [  726.110186] task:mozStorage #2   state:D stack:    0 pid: 2430 ppid:  1945 flags:0x00000000
Jul 26 20:48:40 WZabHP kernel: [  726.110187] Call Trace:
Jul 26 20:48:40 WZabHP kernel: [  726.110188]  __schedule+0x282/0x870
Jul 26 20:48:40 WZabHP kernel: [  726.110189]  schedule+0x46/0xb0
Jul 26 20:48:40 WZabHP kernel: [  726.110192]  jbd2_log_wait_commit+0xac/0x120 [jbd2]
Jul 26 20:48:40 WZabHP kernel: [  726.110193]  ? add_wait_queue_exclusive+0x70/0x70
Jul 26 20:48:40 WZabHP kernel: [  726.110196]  __jbd2_journal_force_commit+0x5d/0xb0 [jbd2]
Jul 26 20:48:40 WZabHP kernel: [  726.110199]  jbd2_journal_force_commit+0x1d/0x30 [jbd2]
Jul 26 20:48:40 WZabHP kernel: [  726.110205]  ext4_sync_file+0x2c4/0x350 [ext4]
Jul 26 20:48:40 WZabHP kernel: [  726.110206]  __x64_sys_fsync+0x34/0x60
Jul 26 20:48:40 WZabHP kernel: [  726.110207]  do_syscall_64+0x33/0x80
Jul 26 20:48:40 WZabHP kernel: [  726.110209]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 26 20:48:40 WZabHP kernel: [  726.110209] RIP: 0033:0x7fd2cf3385eb
Jul 26 20:48:40 WZabHP kernel: [  726.110210] RSP: 002b:00007fd2b334c300 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Jul 26 20:48:40 WZabHP kernel: [  726.110211] RAX: ffffffffffffffda RBX: 00007fd2b34a6858 RCX: 00007fd2cf3385eb
Jul 26 20:48:40 WZabHP kernel: [  726.110212] RDX: 00000000000a0000 RSI: 00007fd2b334c0f0 RDI: 0000000000000072
Jul 26 20:48:40 WZabHP kernel: [  726.110212] RBP: 0000000000000002 R08: 0000000000000000 R09: 000000000000003e
Jul 26 20:48:40 WZabHP kernel: [  726.110213] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
Jul 26 20:48:40 WZabHP kernel: [  726.110213] R13: 00007fd2aeffbd40 R14: 00007fd2b34a67a0 R15: 000000000000004e

因此,问题与文件系统的低同步有关。磁盘吞吐量足够。但是,问题出现在带有加密卷的安装中。CPU 性能(尤其是在 Xeon 机器的情况下)足以足够快地执行加密。

加密所需的另一种稀缺资源是真正的随机数据。事实上,我现在可以访问的系统似乎没有安装 rng-tools5 或 rng-tools-debian。这可能是问题的根源。

答案1

我发现了类似的问题https://askubuntu.com/questions/1406444/what-is-causing-my-system-to-stall-freeze-corrupt-data-when-using-lvm-luks

这里描述的修改之一已经消除了大规模写入时系统冻结的情况。

cryptsetup  --perf-no_write_workqueue refresh name_of_the_mapping

(就我而言是cryptsetup --perf-no_write_workqueue refresh sda6_crypt:)

可以通过执行以下操作永久设置该选项:

cryptsetup  --perf-no_write_workqueue --persistent refresh name_of_the_mapping

我已将问题报告为错误https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1016474,但这是否是cryptsetup包中的一个错误是值得怀疑的。

因此完整的解决方案似乎是:

  1. 安装haveged软件包
  2. 使用--perf-no_write_workqueue加密 LVM 映射中的选项。

答案2

问题似乎与系统中熵的缺乏有关(如评论中所述)。这解释了为什么问题会出现在加密 LVM 的系统中。写入数据的加密需要真正的随机字节。

我已经安装了rng-tools5haveged(按照@michalng 的建议,谢谢!),问题似乎有所缓解。但是,还需要进一步彻底测试才能确认这确实是该问题的正确解决方案。

相关内容