我有一个 NVIDIA Jetson TX2 系统,带有 NVMe 驱动器,速度为 PCIe gen2 x4。系统运行的是 Linux 4.9 内核,我遇到写入性能问题。我的目标是以 260 MB/s 左右的速度将顺序数据写入磁盘。我相信硬件应该支持这一点。用于fio
基准测试
fio --name=seq_write --filename=testfile --size=10G --bs=2m --rw=write --time_based --runtime=20 --direct=1 --ioengine=libaio --iodepth=1 --output=seq-write-direct.out
我可以看到,当绕过缓冲区时,写入速度约为 870 MB/s。
seq_write: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=libaio, iodepth=1
fio-3.1
Starting 1 process
seq_write: (groupid=0, jobs=1): err= 0: pid=3639: Tue Mar 9 17:28:02 2021
write: IOPS=414, BW=829MiB/s (870MB/s)(16.2GiB/20001msec)
slat (usec): min=319, max=3021, avg=1187.86, stdev=754.71
clat (usec): min=570, max=3826, avg=1077.90, stdev=139.89
lat (usec): min=1624, max=4674, avg=2266.94, stdev=645.00
clat percentiles (usec):
| 1.00th=[ 709], 5.00th=[ 799], 10.00th=[ 807], 20.00th=[ 1012],
| 30.00th=[ 1074], 40.00th=[ 1106], 50.00th=[ 1123], 60.00th=[ 1123],
| 70.00th=[ 1123], 80.00th=[ 1156], 90.00th=[ 1172], 95.00th=[ 1287],
| 99.00th=[ 1303], 99.50th=[ 1303], 99.90th=[ 1352], 99.95th=[ 1434],
| 99.99th=[ 3818]
bw ( KiB/s): min=614400, max=1093632, per=100.00%, avg=889105.49, stdev=204457.74, samples=37
iops : min= 300, max= 534, avg=434.08, stdev=99.89, samples=37
lat (usec) : 750=1.75%, 1000=15.58%
lat (msec) : 2=82.63%, 4=0.05%
cpu : usr=6.88%, sys=47.06%, ctx=9040, majf=0, minf=22
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwt: total=0,8295,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=829MiB/s (870MB/s), 829MiB/s-829MiB/s (870MB/s-870MB/s), io=16.2GiB (17.4GB), run=20001-20001msec
Disk stats (read/write):
nvme0n1: ios=0/16381, merge=0/9, ticks=0/23476, in_queue=23460, util=73.80%
当关闭直接写入时,这看起来非常不同。
fio --name=seq_write --filename=testfile --size=10G --bs=2m --rw=write --time_based --runtime=20 --direct=0 --ioengine=libaio --iodepth=1 --output=seq-write.out
seq_write: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=libaio, iodepth=1
fio-3.1
Starting 1 process
seq_write: (groupid=0, jobs=1): err= 0: pid=1461: Tue Mar 9 17:25:12 2021
write: IOPS=130, BW=260MiB/s (273MB/s)(5206MiB/20003msec)
slat (msec): min=6, max=108, avg= 7.19, stdev= 2.06
clat (usec): min=14, max=346, avg=20.14, stdev= 7.23
lat (msec): min=6, max=108, avg= 7.22, stdev= 2.06
clat percentiles (usec):
| 1.00th=[ 16], 5.00th=[ 19], 10.00th=[ 19], 20.00th=[ 20],
| 30.00th=[ 20], 40.00th=[ 20], 50.00th=[ 21], 60.00th=[ 21],
| 70.00th=[ 21], 80.00th=[ 21], 90.00th=[ 21], 95.00th=[ 22],
| 99.00th=[ 24], 99.50th=[ 37], 99.90th=[ 65], 99.95th=[ 122],
| 99.99th=[ 347]
bw ( KiB/s): min=176128, max=294912, per=100.00%, avg=280496.89, stdev=21426.33, samples=38
iops : min= 86, max= 144, avg=136.95, stdev=10.47, samples=38
lat (usec) : 20=50.33%, 50=49.29%, 100=0.31%, 250=0.04%, 500=0.04%
cpu : usr=1.90%, sys=96.59%, ctx=810, majf=0, minf=20
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwt: total=0,2603,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=260MiB/s (273MB/s), 260MiB/s-260MiB/s (273MB/s-273MB/s), io=5206MiB (5459MB), run=20003-20003msec
Disk stats (read/write):
nvme0n1: ios=0/4558, merge=0/6, ticks=0/286996, in_queue=286996, util=15.04%
性能下降至 260 MB/s 左右。此下降似乎与 CPU 相关。我想知道是否可以在内核方面做一些事情来提高写入性能。我在 NVIDIA 论坛上问过类似的问题TX1 和 TX2 之间的性能下降。
答案1
fio
(感谢您在问题中包含相关输出并对其进行格式化 - 这真的很有帮助!)
libaio
当您不使用直接 I/O 时,Linux AIO(由 fio 的 ioengine 使用)可能会阻塞(参见第1点)但由于您的iodepth
值只有1,因此您最好使用同步I/O引擎(例如pvsync2
),因为您正在为异步机器付费,但您选择不使用(您能详细说明原因吗?)。您发布的统计数据似乎正在发出警告(您可以在不同种类上述工作中存在的延迟)?
此外,当fio
第二次运行中的第二次“结束”时,您无法知道有多少 I/O 仍然只在 Linux 内核缓存中滚动,因此不清楚您要比较的内容。在这两项工作中,您都会经历一个文件系统,因此这可能会使您正在测量的内容变得复杂。随意的回答者无法告诉你的机器有多少 RAM,我们也不知道内核在什么时候必须开始刷新 I/O,以便为更多缓冲 I/O 腾出空间......
太长了;作为一个路人,我需要看到更多才能有一个完整的画面,所以我不能说任何具体的东西。但是,您发布的内容使您的基准测试方法看起来很可疑 - 也许您正在基准测试与您期望的不同的东西?