我们在 gluster 池中有 3 台服务器。每台计算机的规格如下(OVH 的 Advance STOR-2 Gen 2):
- AMD 锐龙 7 Pro 3700 - 8c/16t - 3.6 GHz/4.4 GHz
- 6 x 14 TB 磁盘(WD DC HC 530,使用 CMR 技术)西部数据文档
- 2 个系统补充驱动器。
- 64 通道 ECC 2933 MHz
以下规格属于一台机器,但如果不完全相同的话应该也相似。
系统:
- Ubuntu 22.04.1 LTS
- 5.15.0-69-通用
zfs 版本:
- zfs-2.1.5-1ubuntu6~22.04.1
- zfs-kmod-2.1.5-1ubuntu6~22.04.1
控制者:
-
2b:00.0 Mass storage controller [0180]: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 [1000:0097] (rev 02) Subsystem: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 [1000:1000] Kernel driver in use: mpt3sas Kernel modules: mpt3sas
我不知道控制器是如何配置的(是否为 HBA),但是
- 此类机器上没有可用的硬件 RAID
- 我可以访问每个驱动器的智能数据。(如果不是 HBA,至少是 JBOD)
我们有一个驱动器出现故障,并进行了更换。这时我们发现了一个问题:
#> zpool status; echo; date
pool: storage
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue Apr 18 20:44:53 2023
17.4T scanned at 27.9M/s, 17.4T issued at 27.9M/s, 17.7T total
3.31T resilvered, 98.12% done, 03:29:17 to go
config:
NAME STATE READ WRITE CKSUM
storage DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
wwn-0x5000cca2ad235164 ONLINE 0 0 0
wwn-0x5000cca28f4ec59c ONLINE 0 0 0
wwn-0x5000cca2ad29cc1c ONLINE 0 0 0
wwn-0x5000cca2a31743d4 ONLINE 0 0 0
wwn-0x5000cca2a40f9b00 ONLINE 0 0 0
replacing-5 DEGRADED 0 0 0
9949261471066455025 UNAVAIL 0 0 0 was /dev/disk/by-id/wwn-0x5000cca2ad2eba3c-part1
scsi-SWDC_WUH721414AL5201_9LKLGWSG ONLINE 0 0 0 (resilvering)
errors: No known data errors
Wed Apr 26 10:44:02 UTC 2023
忘记向您展示池的用法了:
zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
storage 76.4T 18.3T 58.1T - - 15% 23% 1.00x ONLINE -
Fri Apr 28 14:32:58 UTC 2023
尽管 gluster 正在并行运行,但我认为它不会造成太大的吞吐量。
接下来是极高的延迟:
我dmesg
看到了这些输出:
[Tue Apr 25 10:30:31 2023] INFO: task txg_sync:1985 blocked for more than 120 seconds.
[Tue Apr 25 10:30:31 2023] Tainted: P O 5.15.0-69-generic #76-Ubuntu
[Tue Apr 25 10:30:31 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Apr 25 10:30:31 2023] task:txg_sync state:D stack: 0 pid: 1985 ppid: 2 flags:0x00004000
[Tue Apr 25 10:30:31 2023] Call Trace:
[Tue Apr 25 10:30:31 2023] <TASK>
[Tue Apr 25 10:30:31 2023] __schedule+0x24e/0x590
[Tue Apr 25 10:30:31 2023] schedule+0x69/0x110
[Tue Apr 25 10:30:31 2023] cv_wait_common+0xf8/0x130 [spl]
[Tue Apr 25 10:30:31 2023] ? wait_woken+0x70/0x70
[Tue Apr 25 10:30:31 2023] __cv_wait+0x15/0x20 [spl]
[Tue Apr 25 10:30:31 2023] arc_read+0x1e1/0x15c0 [zfs]
[Tue Apr 25 10:30:31 2023] ? arc_evict_cb_check+0x20/0x20 [zfs]
[Tue Apr 25 10:30:31 2023] dsl_scan_visitbp+0x4f5/0xcf0 [zfs]
[Tue Apr 25 10:30:31 2023] dsl_scan_visitbp+0x333/0xcf0 [zfs]
[Tue Apr 25 10:30:31 2023] dsl_scan_visitbp+0x333/0xcf0 [zfs]
[Tue Apr 25 10:30:31 2023] dsl_scan_visitbp+0x333/0xcf0 [zfs]
[Tue Apr 25 10:30:31 2023] dsl_scan_visitbp+0x333/0xcf0 [zfs]
[Tue Apr 25 10:30:31 2023] dsl_scan_visitbp+0x333/0xcf0 [zfs]
[Tue Apr 25 10:30:31 2023] dsl_scan_visitbp+0x813/0xcf0 [zfs]
[Tue Apr 25 10:30:31 2023] dsl_scan_visit_rootbp+0xe8/0x160 [zfs]
[Tue Apr 25 10:30:31 2023] dsl_scan_visitds+0x15d/0x4b0 [zfs]
[Tue Apr 25 10:30:31 2023] ? __kmalloc_node+0x166/0x3a0
[Tue Apr 25 10:30:31 2023] ? do_raw_spin_unlock+0x9/0x10 [spl]
[Tue Apr 25 10:30:31 2023] ? __raw_spin_unlock+0x9/0x10 [spl]
[Tue Apr 25 10:30:31 2023] ? __list_add+0x17/0x40 [spl]
[Tue Apr 25 10:30:31 2023] ? do_raw_spin_unlock+0x9/0x10 [spl]
[Tue Apr 25 10:30:31 2023] ? __raw_spin_unlock+0x9/0x10 [spl]
[Tue Apr 25 10:30:31 2023] ? tsd_hash_add+0x145/0x180 [spl]
[Tue Apr 25 10:30:31 2023] ? tsd_set+0x98/0xd0 [spl]
[Tue Apr 25 10:30:31 2023] dsl_scan_visit+0x1ae/0x2c0 [zfs]
[Tue Apr 25 10:30:31 2023] dsl_scan_sync+0x412/0x910 [zfs]
[Tue Apr 25 10:30:31 2023] spa_sync_iterate_to_convergence+0x124/0x1f0 [zfs]
[Tue Apr 25 10:30:31 2023] spa_sync+0x2dc/0x5b0 [zfs]
[Tue Apr 25 10:30:31 2023] txg_sync_thread+0x266/0x2f0 [zfs]
[Tue Apr 25 10:30:31 2023] ? txg_dispatch_callbacks+0x100/0x100 [zfs]
[Tue Apr 25 10:30:31 2023] thread_generic_wrapper+0x64/0x80 [spl]
[Tue Apr 25 10:30:31 2023] ? __thread_exit+0x20/0x20 [spl]
[Tue Apr 25 10:30:31 2023] kthread+0x12a/0x150
[Tue Apr 25 10:30:31 2023] ? set_kthread_struct+0x50/0x50
[Tue Apr 25 10:30:31 2023] ret_from_fork+0x22/0x30
[Tue Apr 25 10:30:31 2023] </TASK>
不是很频繁,但仍然太多:(正常运行时间:11:10:30 正常运行 7 天,15:27)
[Thu Apr 20 07:47:49 2023] INFO: task txg_sync:1985 blocked for more than 120 seconds.
[Thu Apr 20 09:08:22 2023] INFO: task txg_sync:1985 blocked for more than 120 seconds.
[Thu Apr 20 09:38:35 2023] INFO: task txg_sync:1985 blocked for more than 120 seconds.
[Thu Apr 20 10:16:51 2023] INFO: task txg_sync:1985 blocked for more than 120 seconds.
[Thu Apr 20 10:26:55 2023] INFO: task txg_sync:1985 blocked for more than 120 seconds.
[Fri Apr 21 07:57:48 2023] INFO: task txg_sync:1985 blocked for more than 120 seconds.
[Fri Apr 21 08:58:13 2023] INFO: task txg_sync:1985 blocked for more than 120 seconds.
[Fri Apr 21 09:32:27 2023] INFO: task txg_sync:1985 blocked for more than 120 seconds.
[Fri Apr 21 10:00:39 2023] INFO: task txg_sync:1985 blocked for more than 120 seconds.
[Tue Apr 25 10:30:31 2023] INFO: task txg_sync:1985 blocked for more than 120 seconds.
磁盘年龄相当年轻,这里没有什么值得注意的:
smart_WUH721414AL5201_81G8L1SV.log : power on : 259 days, 21:21:00
smart_WUH721414AL5201_9LHD9Z7G.log : power on : 197 days, 19:08:00
smart_WUH721414AL5201_9LKLGWSG.log : power on : 7 days, 17:25:00
smart_WUH721414AL5201_QBGDTMXT.log : power on : 255 days, 21:44:00
smart_WUH721414AL5201_Y6GME43C.log : power on : 346 days, 22:59:00
smart_WUH721414AL5201_Y6GRZLKC.log : power on : 197 days, 12:56:00
iostat 并不报警(虽然是平均输出,iowait 不会高于 10,最多 15%):
avg-cpu: %user %nice %system %iowait %steal %idle
0.13 0.00 1.49 6.65 0.00 91.73
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
sda 369.80 6115.05 0.04 0.01 5.25 16.54 10.81 328.36 0.01 0.06 0.77 30.38 0.00 0.00 0.00 0.00 0.00 0.00 0.19 7.87 1.95 80.10
sdb 412.40 6215.40 0.02 0.01 5.43 15.07 10.69 328.41 0.01 0.06 0.78 30.72 0.00 0.00 0.00 0.00 0.00 0.00 0.19 10.06 2.25 88.23
sdc 395.09 6004.60 0.02 0.00 5.49 15.20 10.72 328.56 0.01 0.07 0.78 30.66 0.00 0.00 0.00 0.00 0.00 0.00 0.19 8.66 2.18 85.42
sdd 412.57 6229.91 0.02 0.01 5.57 15.10 10.34 328.57 0.01 0.05 0.84 31.77 0.00 0.00 0.00 0.00 0.00 0.00 0.19 14.77 2.31 90.34
sde 374.34 6150.81 0.03 0.01 5.33 16.43 10.74 328.43 0.01 0.06 0.78 30.58 0.00 0.00 0.00 0.00 0.00 0.00 0.19 8.47 2.01 81.84
sdf 25.72 113.11 0.00 0.00 2.82 4.40 219.12 5713.02 0.09 0.04 1.25 26.07 0.00 0.00 0.00 0.00 0.00 0.00 0.18 49.05 0.36 27.09
但是 zpool 延迟是(所有磁盘的延迟都超过 1 秒):
zpool iostat -w
storage total_wait disk_wait syncq_wait asyncq_wait
latency read write read write read write read write scrub trim
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
1ns 0 0 0 0 0 0 0 0 0 0
[...] # all zeros
127ns 0 0 0 0 0 0 0 0 0 0
255ns 0 0 0 0 1.17K 0 12 2.43K 1.61M 0
511ns 0 0 0 0 616K 484K 19.4K 1.39M 121M 0
1us 0 0 0 0 4.65M 895K 205K 10.4M 126M 0
2us 0 0 0 0 11.9M 49.4K 209K 2.74M 9.56M 0
4us 0 0 0 0 7.17M 4.99K 42.5K 168K 2.76M 0
8us 0 0 0 0 53.0K 1.37K 657 163K 3.96M 0
16us 72 0 315 0 11.7K 374 533 172K 2.84M 0
32us 61.3M 4 65.2M 36 768 8 201 269K 4.36M 0
65us 60.8M 96 60.6M 566 236 0 343 426K 6.04M 0
131us 79.4M 271 81.6M 879 80 0 734 824K 7.92M 0
262us 35.0M 443K 175M 464K 116 0 5.64K 2.11M 11.2M 0
524us 36.0M 8.21M 186M 50.5M 44 0 5.09K 5.27M 5.60M 0
1ms 17.5M 9.73M 58.8M 59.5M 89 0 2.63K 5.75M 3.15M 0
2ms 5.30M 13.9M 39.0M 41.3M 114 0 2.31K 9.08M 3.73M 0
4ms 6.29M 15.3M 97.1M 15.8M 176 0 3.48K 11.8M 6.05M 0
8ms 13.5M 12.3M 201M 2.59M 277 0 6.76K 9.97M 9.68M 0
16ms 26.7M 8.84M 198M 779K 334 0 8.13K 8.10M 14.9M 0
33ms 36.1M 9.82M 75.3M 275K 218 0 6.30K 9.17M 23.4M 0
67ms 41.8M 10.8M 12.5M 48.9K 215 0 2.79K 10.5M 37.1M 0
134ms 59.3M 9.52M 1.92M 9.46K 213 0 680 9.33M 57.2M 0
268ms 88.0M 7.43M 543K 893 272 0 121 7.35M 86.7M 0
536ms 132M 7.58M 389K 140 521 0 19 7.55M 131M 0
1s 190M 9.42M 21.4K 59 795 0 8 9.41M 189M 0
2s 205M 12.8M 2.09K 16 1.26K 0 0 12.8M 204M 0
4s 110M 17.9M 565 8 1.36K 0 0 17.9M 109M 0
8s 33.2M 15.0M 0 0 955 0 0 15.0M 33.0M 0
17s 11.3M 2.38M 0 0 269 0 0 2.38M 11.3M 0
34s 3.40M 18.4K 0 0 81 0 0 18.4K 3.40M 0
68s 392K 0 0 0 30 0 0 0 391K 0
137s 31.4K 0 0 0 13 0 0 0 31.4K 0
--------------------------------------------------------------------------------
1秒延迟:
zpool iostat -vw | awk '/^1s/'
1s 191M 9.42M 21.5K 59 795 0 8 9.41M 190M 0
1s 191M 9.42M 21.5K 59 795 0 8 9.41M 190M 0
1s 31.6M 0 3.72K 0 117 0 0 0 31.4M 0
1s 41.1M 0 4.06K 0 198 0 0 0 41.0M 0
1s 41.8M 0 4.31K 0 171 0 0 0 41.7M 0
1s 40.3M 15 4.14K 0 204 0 0 15 40.2M 0
1s 35.8M 2 3.98K 0 105 0 0 1 35.6M 0
1s 1.91K 9.42M 1.29K 59 0 0 8 9.41M 554 0
1s 0 0 0 0 0 0 0 0 0 0
1s 1.91K 9.42M 1.29K 59 0 0 8 9.41M 554 0
我很抱歉透露了这么多信息。我尽量说得简明扼要。
需要说明的是,我正在尝试找出这种爬行速度背后的原因。但我所看到的一切似乎都是正确的。
- smartctl 报告并不令人担忧(与 OVH 交流)
- 我确实发现 ECC 发挥作用令人担忧,但我怀疑 18 个驱动器有问题:
smart_WUH721414AL5201_81G8L1SV.log - total errors corrected -> 193
smart_WUH721414AL5201_9LHD9Z7G.log - total errors corrected -> 4
smart_WUH721414AL5201_QBGDTMXT.log - total errors corrected -> 6
smart_WUH721414AL5201_Y6GME43C.log - total errors corrected -> 13
无论如何,我对此感到困惑。如果您知道该去哪里找,或者需要更多信息,请直接询问!
感谢您的时间。
编辑:添加了输出zpool list
- 重建终于完成了:
Fri Apr 28 03:33:32 2023