硬件

2024-5-31 • tag-icon

我在获得 RAID5 + crypt + ext4 的可接受的读/写性能时遇到了一些问题，最后将其归结为以下问题：

硬件

硬盘 4x WD RED 3 TB WDC WD30EFRX-68EUZN0 作为 /dev/sd[efgh]
sde 和 sdf 通过控制器 A 使用 3 Gbps/s SATA 链路连接（尽管 6 Gbps 可用）
sdg 和 sdh 通过控制器 B 使用 6 Gbps SATA 链路进行连接

单盘性能

对每个磁盘进行 4 次写入测试（一切如我预期）

# dd if=/dev/zero of=/dev/sd[efgh] bs=2G count=1 oflag=dsync
sde: 2147479552 bytes (2.1 GB) copied, xxx s, [127, 123, 132, 127] MB/s
sdf: 2147479552 bytes (2.1 GB) copied, xxx s, [131, 130, 118, 137] MB/s
sdg: 2147479552 bytes (2.1 GB) copied, xxx s, [145, 145, 145, 144] MB/s
sdh: 2147479552 bytes (2.1 GB) copied, xxx s, [126, 132, 132, 132] MB/s

使用 hdparm 和 dd 读取测试（一切如我预期）

# hdparm -tT /dev/sd[efgh]
# echo 3 | tee /proc/sys/vm/drop_caches; dd of=/dev/null if=/dev/sd[efgh] bs=2G count=1 iflag=fullblock

(sde)
Timing cached reads:   xxx MB in  2.00 seconds = [13983.68, 14136.87] MB/sec
Timing buffered disk reads: xxx MB in  3.00 seconds = [143.16, 143.14] MB/sec
2147483648 bytes (2.1 GB) copied, xxx s, [140, 141] MB/s

(sdf)
Timing cached reads:   xxx MB in  2.00 seconds = [14025.80, 13995.14] MB/sec
Timing buffered disk reads: xxx MB in  3.00 seconds = [140.31, 140.61] MB/sec
2147483648 bytes (2.1 GB) copied, xxx s, [145, 141] MB/s

(sdg)
Timing cached reads:   xxx MB in  2.00 seconds = [14005.61, 13801.93] MB/sec
Timing buffered disk reads: xxx MB in  3.00 seconds = [153.11, 151.73] MB/sec
2147483648 bytes (2.1 GB) copied, xxx s, [154, 155] MB/s

(sdh)
Timing cached reads:   xxx MB in  2.00 seconds = [13816.84, 14335.93] MB/sec
Timing buffered disk reads: xxx MB in  3.00 seconds = [142.50, 142.12] MB/sec
2147483648 bytes (2.1 GB) copied, xxx s, [140, 140] MB/s

sd[efgh] 上的分区

4x 32 GiB 用于测试

# gdisk -l /dev/sd[efgh]
GPT fdisk (gdisk) version 0.8.10

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sde: 5860533168 sectors, 2.7 TiB
Logical sector size: 512 bytes
Disk identifier (GUID): xxx
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 5860533134
Partitions will be aligned on 2048-sector boundaries
Total free space is 5793424237 sectors (2.7 TiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048        67110911   32.0 GiB    FD00  Linux RAID

磁盘阵列

# mdadm --create --verbose /dev/md0 --level=5 --raid-devices=4 --chunk=256K /dev/sd[efgh]1
(some tests later ...)
# mdadm --grow --verbose /dev/md0 --layout=right-asymmetric
# mdadm --detail /dev/md0
/dev/md0:
    Version : 1.2
  Creation Time : Sat Dec 10 03:07:56 2016
     Raid Level : raid5
     Array Size : 100561920 (95.90 GiB 102.98 GB)
  Used Dev Size : 33520640 (31.97 GiB 34.33 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Sat Dec 10 23:56:53 2016
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : right-asymmetric
     Chunk Size : 256K

           Name : vm:0  (local to host vm)
           UUID : 80d0f886:dc380755:5387f78c:1fac60da
         Events : 158

    Number   Major   Minor   RaidDevice State
       0       8       65        0      active sync   /dev/sde1
       1       8       81        1      active sync   /dev/sdf1
       2       8       97        2      active sync   /dev/sdg1
       4       8      113        3      active sync   /dev/sdh1

现在的情况

我预计该阵列的连续读写性能大约在 350 - 400 MB/s 之间。读取或写入整个卷实际上会在这个范围内完美地产生结果：

# echo 3 | tee /proc/sys/vm/drop_caches; dd of=/dev/null if=/dev/md0 bs=256K
102975406080 bytes (103 GB) copied, 261.373 s, 394 MB/s

# dd if=/dev/zero of=/dev/md0 bs=256K conv=fdatasync
102975406080 bytes (103 GB) copied, 275.562 s, 374 MB/s

但是，写入性能很大程度上取决于写入的数据量。正如预期的那样，传输速率会随着数据量的增加而增加，但当达到 2 GiB 时就会下降，并且只有在进一步增加大小时才会缓慢恢复：

# dd if=/dev/zero of=/dev/md0 bs=256K conv=fdatasync count=x
count=1: 262144 bytes (262 kB) copied, xxx s, [3.6, 7.6, 8.9, 8.9] MB/s
count=2: 524288 bytes (524 kB) copied, xxx s, [3.1, 17.7, 15.3, 15.7] MB/s
count=4: 1048576 bytes (1.0 MB) copied, xxx s, [13.2, 23.9, 26.9, 25.4] MB/s
count=8: 2097152 bytes (2.1 MB) copied, xxx s, [24.3, 46.7, 45.9, 42.8] MB/s
count=16: 4194304 bytes (4.2 MB) copied, xxx s, [5.1, 77.3, 42.6, 73.2, 79.8] MB/s
count=32: 8388608 bytes (8.4 MB) copied, xxx s, [68.6, 101, 99.7, 101] MB/s
count=64: 16777216 bytes (17 MB) copied, xxx s, [52.5, 136, 159, 159] MB/s
count=128: 33554432 bytes (34 MB) copied, xxx s, [38.5, 175, 185, 189, 176] MB/s
count=256: 67108864 bytes (67 MB) copied, xxx s, [53.5, 244, 229, 238] MB/s
count=512: 134217728 bytes (134 MB) copied, xxx s, [111, 288, 292, 288] MB/s
count=1K: 268435456 bytes (268 MB) copied, xxx s, [171, 328, 319, 322] MB/s
count=2K: 536870912 bytes (537 MB) copied, xxx s, [228, 337, 330, 334] MB/s
count=4K: 1073741824 bytes (1.1 GB) copied, xxx s, [338, 348, 348, 343] MB/s <-- ok!
count=8K: 2147483648 bytes (2.1 GB) copied, xxx s, [168, 147, 138, 139] MB/s <-- bad!
count=16K: 4294967296 bytes (4.3 GB) copied, xxx s, [155, 160, 178, 144] MB/s
count=32K: 8589934592 bytes (8.6 GB) copied, xxx s, [256, 238, 264, 246] MB/s
count=64K: 17179869184 bytes (17 GB) copied, xxx s, [298, 285] MB/s
count=128K: 34359738368 bytes (34 GB) copied, xxx s, [347, 336] MB/s
count=256K: 68719476736 bytes (69 GB) copied, xxx s, [363, 356] MB/s <-- getting better

（低于 2 GiB 时，第一次测量似乎表明使用了一些读取缓存）

在传输 2 GiB 或更多数据时，我发现了一些奇怪的事情iotop：

阶段 1：开始时“总磁盘写入”和“实际磁盘写入”都约为“400 MB/s”。IOdd值约为 85%，而其他所有值均为 0%。此阶段在较大的传输中持续时间更长。
阶段 2：在传输完成前几秒（~16 秒），akworker跳入并从窃取 30 - 50 个百分点的 IO dd。分布在 30:50 % 和 50:30 % 之间波动。同时，“总磁盘写入”下降到 0 B/s，“实际磁盘写入”在 20 - 70 MB/s 之间跳跃。这个阶段似乎持续一段恒定的时间。
阶段 3：在最后 3 秒内，“实际磁盘写入”跳升至 > 400 MB/s，而“总磁盘写入”保持在 0 B/s。dd并且kworker两者的 IO 值均为 0%
阶段 4：IO 值dd在一秒钟内跃升至 5%。同时传输完成。

问题

那个神秘的第 2 阶段是什么呢？两个进程似乎都在争夺 IO。

谁在第三阶段将数据传输到硬件？

最重要的是：如何才能最大限度地减少奇怪的效果，以获得阵列似乎能够提供的完整 400 MB/s？（或者我甚至在问 XY 问题？）

奖金

在达到当前状态之前，我们经历了漫长的反复试验。我将调度程序从切换cfq到noop，并将 RAID 块大小从 512k 减小到 256k，结果略有改善。更改为--layout=right-asymmetric并没有改变任何东西。暂时停用硬盘的写入缓存会导致性能下降。

第一句提到的crypt layer目前完全不存在，后面会重新介绍。

# uname -a
Linux vm 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux

答案1

您所看到的是dd命令行的一个产物，具体来说来自选项conv=fdatasync。来自手册页：

每个 CONV 符号可能是：
...
fdatasync：物理地写入输出文件数据完成之前
...

conv=fdatasync基本上指示dd在返回之前执行单个、最终的 fdatasync 系统调用。然而，dd 运行时写入会被缓存。您的 I/O 阶段可以解释如下：

dd快速写入页面缓存，无需实际接触磁盘
页面缓存几乎已满，kworker内核开始将其刷新到磁盘。在页面缓存刷新期间，dd会短暂暂停（导致高iowait）；在释放一些页面缓存后，dd可以恢复操作
总磁盘写入量和实际磁盘写入量之间的差异iotop取决于页面缓存的填充和刷新方式
循环重复

简而言之，这里没有问题。如果你想观察未缓存的行为，请将其替换conv=fdatasync为oflag=direct：有了这个标志，你可以完全绕过页面缓存。

为了观察缓存但同步的行为，请替换conv=fdatasync为oflag=sync：使用此标志，dd在将每个块写入磁盘时调用 fdatasync。

可以通过微调 I/O 堆栈（即：I/O 调度程序、合并行为、条带缓存、ecc）来获得进一步的优化，但这是另一个问题。

硬件

硬件

单盘性能

sd[efgh] 上的分区

磁盘阵列

现在的情况

更多测试

问题

奖金

答案1

相关内容