9x7 驱动器 raidz2（ZFS ZoL 0.8.1）上的连续速度很慢

2024-6-1 • tag-icon

我在 Ubuntu 18.04 上运行一个大型 ZFS 池，该池通过 iSCSI（用于备份）构建，用于 256K+ 请求大小的顺序读取和写入。考虑到对高吞吐量和空间效率的需求，以及对随机小块性能的需求较少，我选择了条带化 raidz2 而不是条带化镜像。

但是，256K 顺序读取性能远低于我的预期（100 - 200MBps，峰值高达 600MBps）。当 zvols 在 iostat 中达到 ~99% iowait 时，支持设备通常运行在 10% 到 40% iowait 之间，这对我来说意味着瓶颈是我在配置中遗漏了什么，因为它不应该是这个系统中的背板或 CPU，并且顺序工作负载不应该使 ARC 工作太辛苦。

我尝试了很多模块参数（当前配置如下），阅读了数百篇文章、OpenZFS github 上的问题等。调整预取和聚合使我达到了这个性能水平 - 默认情况下，当 ZFS 向磁盘发送微小请求（~16K）时，我在顺序读取时以大约 ~50MBps 的速度运行。如果聚合和预取工作正常（我认为），磁盘读取速度会高得多，在 iostat 中平均约为 ~64K。

NIC 是带有 cxgbit 卸载的 LIO iscsi 目标 + Windows Chelsio iscsi 启动器，在 ZFS zvols 之外运行良好，并且直接映射的 optane 在 NIC 上返回几乎全线速（读写~3.5GBps）。

我期望太多了吗？我知道 ZFS 优先考虑安全性而不是性能，但我希望 7x9 raidz2 能够比单个 9 驱动器 mdadm raid6 提供更好的连续读取。

系统规格和日志/配置文件：

Chassis: Supermicro 6047R-E1R72L
HBAs: 3x 2308 IT mode (24x 6Gbps SAS channels to backplanes)
CPU: 2x E5-2667v2 (8 cores @ 3.3Ghz base each)
RAM: 128GB, 104GB dedicated to ARC
HDDs: 65x HGST 10TB HC510 SAS (9x 7-wide raidz2 + 2 spares)
SSDs: 2x Intel Optane 900P (partitioned for mirrored special and log vdevs)
NIC: Chelsio 40GBps (same as on initiator, both using hw offloaded iSCSI)
OS: Ubuntu 18.04 LTS (using latest non-HWE kernel that allows ZFS SIMD)
ZFS: 0.8.1 via PPA
Initiator: Chelsio iSCSI initiator on Windows Server 2019

池配置：

ashift=12
recordsize=128K (blocks on zvols are 64K, below)
compression=lz4
xattr=sa
redundant_metadata=most
atime=off
primarycache=all

ZVol 配置：

sparse
volblocksize=64K (matches OS allocation unit on top of iSCSI)

泳池布局：

7x 9-wide raidz2
mirrored 200GB optane special vdev (SPA metadata allocation classes)
mirrored 50GB optane log vdev

/etc/modprobe.d/zfs.conf：

# 52 - 104GB ARC, this system does nothing else
options zfs zfs_arc_min=55834574848
options zfs zfs_arc_max=111669149696

# allow for more dirty async data
options zfs zfs_dirty_data_max_percent=25
options zfs zfs_dirty_data_max=34359738368

# txg timeout given we have plenty of Optane ZIL
options zfs zfs_txg_timeout=5

# tune prefetch (have played with this 1000x different ways, no major improvement except max_streams to 2048, which helped, I think)
options zfs zfs_prefetch_disable=0
options zfs zfetch_max_distance=134217728
options zfs zfetch_max_streams=2048
options zfs zfetch_min_sec_reap=3
options zfs zfs_arc_min_prefetch_ms=250
options zfs zfs_arc_min_prescient_prefetch_ms=250
options zfs zfetch_array_rd_sz=16777216

# tune coalescing (same-ish, increasing the read gap limit helped throughput in conjunction with low async read max_active, as it caused much bigger reads to be sent to the backing devices)
options zfs zfs_vdev_aggregation_limit=16777216
options zfs zfs_vdev_read_gap_limit=1048576
options zfs zfs_vdev_write_gap_limit=262144

# ZIO scheduler in priority order 
options zfs zfs_vdev_sync_read_min_active=1
options zfs zfs_vdev_sync_read_max_active=10
options zfs zfs_vdev_sync_write_min_active=1
options zfs zfs_vdev_sync_write_max_active=10
options zfs zfs_vdev_async_read_min_active=1
options zfs zfs_vdev_async_read_max_active=2
options zfs zfs_vdev_async_write_min_active=1
options zfs zfs_vdev_async_write_max_active=4

# zvol threads
options zfs zvol_threads=32

我为此绞尽了脑汁。用户们都要求我使用全 Windows 操作系统的存储空间，但我使用过奇偶校验存储空间（甚至使用带镜像的存储空间直通），而且效果也不好。我很想直接在 iSCSI 下使用 mdadm raid60，但如果有人能指出我遗漏的某些愚蠢之处，并利用 ZFS 的 bitrot 保护来解锁性能，那就太好了 :)

答案1

好问题。

我认为您的稀疏 zvol 块大小应该是 128k。
您的 ZIO 调度程序设置应该都更高，例如最低 10 和最高 64。
zfs_txg_timeout 应该更长。我在我的系统上设置了 15 或 30 秒。
我认为多个 RAIDZ3（或者说是笔误）有点过头了，而且对性能影响很大。你能用 RAIDZ2 进行基准测试吗？

编辑：安装网络数据在系统上监控利用率和 ZFS 统计数据。

编辑 2：这是针对 Veeam 存储库的。Veeam 支持 Linux 作为目标，并且与 ZFS 配合良好。您会考虑使用您的数据对其进行基准测试吗？zvols 不是您正在做的事情的理想用例，除非 NIC 的卸载是解决方案的关键部分。

答案1

相关内容