mdadm 10 远突袭读取速度不均匀

2024-5-15 • tag-icon

为什么 raid 10 的顺序读取速度仅在非常巨大bs（>100MB，dd参数）的情况下增长。进行突袭：

/dev/md127:
Version : 1.2
Raid Level : raid10
        Array Size : 46875009024 (44703.49 GiB 48000.01 GB)
     Used Dev Size : 11718752256 (11175.87 GiB 12000.00 GB)
      Raid Devices : 8
     Total Devices : 8
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

             State : clean
    Active Devices : 8
   Working Devices : 8
    Failed Devices : 0
     Spare Devices : 0

            Layout : far=2
        Chunk Size : 1024K

Consistency Policy : bitmap

我预计顺序读取速度至少为 100MB*n_drives=800+MB/秒。但：

dd if=/dev/md127 of=/dev/null bs=10240k count=1000 iflag=direct
1000+0 records in
1000+0 records out
10485760000 bytes (10 GB, 9.8 GiB) copied, 14.2918 s, 734 MB/s

iostat -zxs 1
Device             tps      kB/s    rqm/s   await  areq-sz  aqu-sz  %util
md127          2880.00 737280.00     0.00    0.00   256.00    0.00   0.00
sda             360.00  92160.00     0.00    5.21   256.00    1.24  70.80
sdb             360.00  92160.00     0.00    5.05   256.00    1.14  74.80
sdc             367.00  93952.00     0.00    5.25   256.00    1.26  76.80
sdd             368.00  94208.00     0.00    6.46   256.00    1.70  80.80
sde             360.00  92160.00     0.00    5.53   256.00    1.31  75.60
sdf             362.00  92672.00     0.00    6.15   256.00    1.54  72.40
sdg             364.00  93184.00     0.00    5.18   256.00    1.24  73.20
sdh             364.00  93184.00     0.00    5.73   256.00    1.40  70.40

如果我测试单驱动器：

dd if=/dev/sda of=/dev/null bs=1024k count=1000 iflag=direct
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 4.25743 s, 246 MB/s

iostat -xs /dev/sda 1
Device             tps      kB/s    rqm/s   await  areq-sz  aqu-sz  %util
sda             868.00 222208.00     0.00    2.79   256.00    0.70 100.00

只有当我设置bs非常大时 - 我可以获得类似于单速 * n_drives 的读取速度：

dd if=/dev/md127 of=/dev/null bs=1024000k count=30 iflag=direct
30+0 records in
30+0 records out
31457280000 bytes (31 GB, 29 GiB) copied, 16.2737 s, 1.9 GB/s

iostat -dxs 1
Device             tps      kB/s    rqm/s   await  areq-sz  aqu-sz  %util
md127         10115.00 2341348.00     0.00    0.00   231.47    0.00   0.00
sda            1077.00 259848.00   187.00  153.82   241.27  163.51  95.20
sdb            1077.00 260612.00   197.00  162.94   241.98  173.33  99.20
sdc            1083.00 262412.00   197.00  160.82   242.30  171.96  98.40
sdd            1067.00 258568.00   195.00  170.78   242.33  180.09 100.00
sde            1086.00 262416.00   195.00  159.38   241.64  170.90  98.40
sdf            1077.00 260360.00   189.00  155.88   241.75  165.71  96.40
sdg            1073.00 259076.00   197.00  160.96   241.45  170.56  98.00
sdh            1085.00 260872.00   191.00  163.61   240.44  175.34  99.60

我的工作负载主要包含顺序读取，但我不确定应用程序是否会发出如此巨大的 IO 读取。

由于测试是直接在块设备上执行的 - 这意味着问题不在 FS 中。 areq-sq在所有情况下都接近相同，但%util对于 raid 来说较低，这是否意味着 raid 在生成请求方面存在问题（但是 aqu-sz 对于 raid 来说非常巨大，也是为什么？）？如何查找原因？

DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS" Linux 5.4.0-67-generic #75-Ubuntu SMP 2 月 19 日星期五 18:03:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

cat /sys/block/md127/queue/scheduler
none

编辑 我曾是错误的大约>100MB在我的第一排。 我用不同的 bs 做了一些实验（以检查对齐的读取变体）：

dd if=/dev/md127 of=/dev/null bs=8M count=3000 iflag=direct
25165824000 bytes (25 GB, 23 GiB) copied, 33.1877 s, 758 MB/s

dd if=/dev/md127 of=/dev/null bs=16M count=300 iflag=direct
5033164800 bytes (5.0 GB, 4.7 GiB) copied, 4.71832 s, 1.1 GB/s

dd if=/dev/md127 of=/dev/null bs=18M count=1000 iflag=direct
18874368000 bytes (19 GB, 18 GiB) copied, 18.4601 s, 1.0 GB/s

dd if=/dev/md127 of=/dev/null bs=20M count=500 iflag=direct
10485760000 bytes (10 GB, 9.8 GiB) copied, 10.0867 s, 1.0 GB/s

dd if=/dev/md127 of=/dev/null bs=32M count=300 iflag=direct
10066329600 bytes (10 GB, 9.4 GiB) copied, 7.29756 s, 1.4 GB/s

dd if=/dev/md127 of=/dev/null bs=128M count=100 iflag=direct
13421772800 bytes (13 GB, 12 GiB) copied, 8.27345 s, 1.6 GB/s

dd if=/dev/md127 of=/dev/null bs=256M count=100 iflag=direct
26843545600 bytes (27 GB, 25 GiB) copied, 15.5701 s, 1.7 GB/s

dd if=/dev/md127 of=/dev/null bs=512M count=100 iflag=direct
53687091200 bytes (54 GB, 50 GiB) copied, 28.9437 s, 1.9 GB/s

dd if=/dev/md127 of=/dev/null bs=1G count=32 iflag=direct
34359738368 bytes (34 GB, 32 GiB) copied, 18.36 s, 1.9 GB/s

尽管仍然很难理解为什么速度随着 bs 的增长而增长（例如从 256M 到 512M），但从bs=32M.

答案1

您的块大小为 1M，因此如果有 8 个驱动器，您的条带大小为 8M。如果您发出 10M 的直接 IO 读取，那么您需要 1 和一个分数条带，因此需要读取两个条带才能完成。如果您打算使用直接 IO，您将希望将 bs 保持为条带大小的偶数倍。普通应用程序也不使用直接 IO，而是通过缓存，因此您可能需要确保预读值足够大（ /sys/block/mdxxx/queue ）。创建数组时使用较小的块大小可能是一个好主意。我以为现在默认是512k，但以前是64k，我不知道为什么要增加。

答案1

相关内容