我的服务器有许多 NVMe 磁盘。我正在fio
使用以下方法测试磁盘性能:
fio --name=asdf --rw=randwrite --direct=1 --ioengine=libaio --bs=16k --numjobs=8 --size=10G --runtime=60 --group_reporting
对于单个磁盘,LUKS 对性能的影响并不大。
我尝试使用mdadm
6 个磁盘raid10
+ XFS 文件系统。它表现良好。
但是当我在 mdadm 设备上创建 LUKS 容器时,性能却非常糟糕:
回顾一下:
- 6盘
mdadm
RAID10+XFS=正常性能的116%,即16%更好的与单个磁盘 + XFS 相比的写入吞吐量和 IOPS - 6盘
mdadm
RAID10 + LUKS + XFS = 正常性能的33%,即67%更差与单个磁盘 + XFS 相比的写入吞吐量和 IOPS
在所有其他场景中,我都没有观察到 LUKS 和非 LUKS 之间的性能差异。包括 LVM 跨区、条带化和镜像。换句话说,mdadm
具有 6 个磁盘的 RAID10(我理解这是跨 3 个 2 磁盘镜像)、带有 LUKS 容器和 XFS 或 ext4 文件系统,与以下方案相比,其各方面性能都较差:
- 带/不带 LUKS 的单磁盘
- 2 个由 LVM 镜像的 LUKS 磁盘(2 个 LUKS 容器)
- 2 个由 LVM 跨越的 LUKS 磁盘(2 个 LUKS 容器)
我想要一个 LUKS 容器位于 RAID10 之上mdadm
。这是最容易理解的配置,并且是 ServerFault、reddit 等网站上许多人推荐的配置。我看不出先对磁盘进行 LUKS 然后再将它们加入阵列有什么更好的方法,尽管我还没有测试过这一点。似乎大多数人都推荐以下顺序:MDADM => LUKS => LVM => 文件系统。
我在网上看到的很多建议都是关于以某种方式将 RAID 阵列的条带大小与其他东西(LUKS?文件系统?)对齐,但他们推荐的配置选项不再可用。例如,在 Ubuntu 18.04 中,stripe_cache_size
我没有要设置的选项。
对我来说唯一有用的是本页上的说明。我确实有相同的 CPU,是 AMD EPYC 的一个变体。
在具有 6 个 NVMe 驱动器的 Ubuntu 18.04 上,MDADM + LUKS + FileSystem (XFS) 是否存在根本性问题?如果存在,我很乐意了解问题所在。如果不是,那么是什么原因导致非 LUKS 和 LUKS 之间的性能差距如此之大?我在测试运行时检查了 CPU 和内存,两者都没有饱和。
好奇心:
使用 75/25 R/W 组合时,MDADM + LUKS + XFS 的性能优于 MDADM + XFS。这有道理吗?我认为 LUKS 应该总是比没有 LUKS 差一点,尤其是libaio
direct=1
......
编辑1
@迈克尔·汉普顿
processor : 0
vendor_id : AuthenticAMD
cpu family : 23
model : 49
model name : AMD EPYC 7452 32-Core Processor
stepping : 0
microcode : 0x8301034
cpu MHz : 1499.977
cache size : 512 KB
physical id : 0
siblings : 64
core id : 0
cpu cores : 32
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 16
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht s
yscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmper
f pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extap
ic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_
llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed ad
x smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero ir
perf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold
avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca
bugs : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass
bogomips : 4699.84
TLB size : 3072 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]
...继续直到过程 63。
什么硬件?出色地,nvme list
:
sudo nvme list
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 BTLJ0086052F2P0BGN INTEL SSDPE2KX020T8 1 2.00 TB / 2.00 TB 512 B + 0 B VDV10152
/dev/nvme1n1 BTLJ007503YS2P0BGN INTEL SSDPE2KX020T8 1 2.00 TB / 2.00 TB 512 B + 0 B VDV10152
/dev/nvme2n1 BTLJ008609DJ2P0BGN INTEL SSDPE2KX020T8 1 2.00 TB / 2.00 TB 512 B + 0 B VDV10152
/dev/nvme3n1 BTLJ008609KE2P0BGN INTEL SSDPE2KX020T8 1 2.00 TB / 2.00 TB 512 B + 0 B VDV10152
/dev/nvme4n1 BTLJ00860AB92P0BGN INTEL SSDPE2KX020T8 1 2.00 TB / 2.00 TB 512 B + 0 B VDV10152
/dev/nvme5n1 BTLJ007302142P0BGN INTEL SSDPE2KX020T8 1 2.00 TB / 2.00 TB 512 B + 0 B VDV10152
/dev/nvme6n1 BTLJ008609VC2P0BGN INTEL SSDPE2KX020T8 1 2.00 TB / 2.00 TB 512 B + 0 B VDV10152
/dev/nvme7n1 BTLJ0072065K2P0BGN INTEL SSDPE2KX020T8 1 2.00 TB / 2.00 TB 512 B + 0 B VDV10152
什么 Linux 发行版?Ubuntu xenial 18.04
什么内核? uname -r
给出4.15.0-121-generic
@anx
numactl --hardware
给出
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 0 size: 1019928 MB
node 0 free: 1015402 MB
node distances:
node 0
0: 10
cryptsetup benchmark
给出
# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1 1288176 iterations per second for 256-bit key
PBKDF2-sha256 1466539 iterations per second for 256-bit key
PBKDF2-sha512 1246820 iterations per second for 256-bit key
PBKDF2-ripemd160 916587 iterations per second for 256-bit key
PBKDF2-whirlpool 698119 iterations per second for 256-bit key
argon2i 6 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
argon2id 6 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
# Algorithm | Key | Encryption | Decryption
aes-cbc 128b 1011.5 MiB/s 3428.1 MiB/s
serpent-cbc 128b 90.2 MiB/s 581.3 MiB/s
twofish-cbc 128b 174.3 MiB/s 340.6 MiB/s
aes-cbc 256b 777.0 MiB/s 2861.3 MiB/s
serpent-cbc 256b 93.6 MiB/s 581.9 MiB/s
twofish-cbc 256b 179.1 MiB/s 340.6 MiB/s
aes-xts 256b 1630.3 MiB/s 1641.3 MiB/s
serpent-xts 256b 579.2 MiB/s 571.9 MiB/s
twofish-xts 256b 336.2 MiB/s 335.8 MiB/s
aes-xts 512b 1438.0 MiB/s 1438.3 MiB/s
serpent-xts 512b 583.3 MiB/s 571.6 MiB/s
twofish-xts 512b 336.9 MiB/s 335.7 MiB/s
磁盘铭牌 RIO?不确定您的意思,但我猜您的意思是磁盘硬件:
英特尔 SSDPE2KX020T8 - 随机写入速度为 2000 MB/s
@shodanshok
我的 RAID 阵列正在重建,它做了一件奇怪的事情,当我重新启动它时,它从到/dev/md0
丢失/dev/md127
第一个设备。
所以我dd
删除了 6 个磁盘中的每个磁盘的前 1G,然后重建
mdadm --create --verbose /dev/md0 --level=10 --raid-devices=6 /dev/nvme[0-5]n1
mdadm: layout defaults to n2
mdadm: layout defaults to n2
mdadm: chunk size defaults to 512K
mdadm: size set to 1953382400K
mdadm: automatically enabling write-intent bitmap on large array
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
现在mdadm -D /dev/md0
说
/dev/md0:
Version : 1.2
Creation Time : Tue Oct 20 07:27:19 2020
Raid Level : raid10
Array Size : 5860147200 (5588.67 GiB 6000.79 GB)
Used Dev Size : 1953382400 (1862.89 GiB 2000.26 GB)
Raid Devices : 6
Total Devices : 6
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Tue Oct 20 07:27:50 2020
State : clean, resyncing
Active Devices : 6
Working Devices : 6
Failed Devices : 0
Spare Devices : 0
Layout : near=2
Chunk Size : 512K
Consistency Policy : bitmap
Resync Status : 0% complete
Name : large20q3-co-120:0 (local to host large20q3-co-120)
UUID : 6d422227:dbfac37a:484c8c59:7ce5cf6e
Events : 6
Number Major Minor RaidDevice State
0 259 1 0 active sync set-A /dev/nvme0n1
1 259 0 1 active sync set-B /dev/nvme1n1
2 259 3 2 active sync set-A /dev/nvme2n1
3 259 5 3 active sync set-B /dev/nvme3n1
4 259 7 4 active sync set-A /dev/nvme4n1
5 259 9 5 active sync set-B /dev/nvme5n1
@Mike Andrews
重建完成。
编辑2
因此重建后,我在容器上创建了 luks 容器和 XFS 文件系统。
然后我尝试fio
不指定ioengine
并numjobs
增加到128
。
fio --name=randwrite --rw=randwrite --direct=1 --bs=16k --numjobs=128 --size=10G --runtime=60 --group_reporting
randw: (g=0): rw=randwrite, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=psync, iodepth=1
...
fio-3.1
Starting 128 processes
randw: Laying out IO file (1 file / 10240MiB)
Jobs: 128 (f=128): [w(128)][100.0%][r=0KiB/s,w=1432MiB/s][r=0,w=91.6k IOPS][eta 00m:00s]
randw: (groupid=0, jobs=128): err= 0: pid=17759: Wed Oct 21 04:02:36 2020
write: IOPS=103k, BW=1615MiB/s (1693MB/s)(94.9GiB/60148msec)
clat (usec): min=96, max=6186.3k, avg=1231.81, stdev=10343.03
lat (usec): min=97, max=6186.3k, avg=1232.92, stdev=10343.03
clat percentiles (usec):
| 1.00th=[ 898], 5.00th=[ 930], 10.00th=[ 955], 20.00th=[ 971],
| 30.00th=[ 996], 40.00th=[ 1012], 50.00th=[ 1020], 60.00th=[ 1037],
| 70.00th=[ 1057], 80.00th=[ 1090], 90.00th=[ 1827], 95.00th=[ 2024],
| 99.00th=[ 2147], 99.50th=[ 2245], 99.90th=[ 9634], 99.95th=[ 16188],
| 99.99th=[274727]
bw ( KiB/s): min= 32, max=16738, per=0.80%, avg=13266.43, stdev=3544.46, samples=15038
iops : min= 2, max= 1046, avg=828.56, stdev=221.45, samples=15038
lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.02%, 1000=34.71%
lat (msec) : 2=59.09%, 4=6.03%, 10=0.05%, 20=0.05%, 50=0.01%
lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2000=0.01%, >=2000=0.01%
cpu : usr=0.31%, sys=2.33%, ctx=6292644, majf=0, minf=1308
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwt: total=0,6216684,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=1615MiB/s (1693MB/s), 1615MiB/s-1615MiB/s (1693MB/s-1693MB/s), io=94.9GiB (102GB), run=60148-60148msec
Disk stats (read/write):
dm-0: ios=3/6532991, merge=0/0, ticks=0/7302772, in_queue=7333424, util=98.56%, aggrios=3/6836535, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
md0: ios=3/6836535, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/2127924, aggrmerge=0/51503, aggrticks=0/102167, aggrin_queue=21846, aggrutil=32.64%
nvme0n1: ios=0/2131196, merge=0/51420, ticks=0/110120, in_queue=25668, util=29.16%
nvme3n1: ios=0/2127405, merge=0/51396, ticks=0/96844, in_queue=19064, util=22.12%
nvme2n1: ios=1/2127405, merge=0/51396, ticks=0/102132, in_queue=22128, util=25.15%
nvme5n1: ios=2/2125172, merge=0/51693, ticks=0/92864, in_queue=17464, util=20.39%
nvme1n1: ios=0/2131196, merge=0/51420, ticks=0/116220, in_queue=28492, util=32.64%
nvme4n1: ios=0/2125172, merge=0/51693, ticks=0/94824, in_queue=18264, util=20.72%
然后我卸载,删除 luks 容器...然后尝试mkfs.xfs -f /dev/md0
打开/dev/md0
,它挂起了...但最终它完成了。我运行了相同的测试。
Jobs: 128 (f=128): [w(128)][100.0%][r=0KiB/s,w=2473MiB/s][r=0,w=158k IOPS][eta 00m:00s]
randw: (groupid=0, jobs=128): err= 0: pid=13910: Wed Oct 21 07:48:59 2020
write: IOPS=276k, BW=4314MiB/s (4523MB/s)(253GiB/60003msec)
clat (usec): min=23, max=853750, avg=460.62, stdev=2832.50
lat (usec): min=24, max=853751, avg=461.24, stdev=2832.50
clat percentiles (usec):
| 1.00th=[ 42], 5.00th=[ 48], 10.00th=[ 53], 20.00th=[ 61],
| 30.00th=[ 68], 40.00th=[ 77], 50.00th=[ 88], 60.00th=[ 102],
| 70.00th=[ 131], 80.00th=[ 693], 90.00th=[ 1762], 95.00th=[ 2180],
| 99.00th=[ 2671], 99.50th=[ 2868], 99.90th=[ 4817], 99.95th=[ 6980],
| 99.99th=[21890]
bw ( KiB/s): min= 1094, max=48449, per=0.78%, avg=34643.43, stdev=7669.85, samples=15360
iops : min= 68, max= 3028, avg=2164.78, stdev=479.37, samples=15360
lat (usec) : 50=7.27%, 100=51.59%, 250=16.09%, 500=3.16%, 750=2.39%
lat (usec) : 1000=2.11%
lat (msec) : 2=10.08%, 4=7.16%, 10=0.12%, 20=0.03%, 50=0.01%
lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
cpu : usr=0.66%, sys=10.31%, ctx=17040235, majf=0, minf=1605
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwt: total=0,16565027,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=4314MiB/s (4523MB/s), 4314MiB/s-4314MiB/s (4523MB/s-4523MB/s), io=253GiB (271GB), run=60003-60003msec
Disk stats (read/write):
md0: ios=1/16941906, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/5682739, aggrmerge=0/2473, aggrticks=0/1218564, aggrin_queue=1186133, aggrutil=74.38%
nvme0n1: ios=0/5685248, merge=0/2539, ticks=0/853448, in_queue=796840, util=66.08%
nvme3n1: ios=0/5681945, merge=0/2474, ticks=0/1807992, in_queue=1812712, util=74.38%
nvme2n1: ios=1/5681946, merge=0/2476, ticks=0/772512, in_queue=718264, util=63.36%
nvme5n1: ios=0/5681023, merge=0/2406, ticks=0/1339628, in_queue=1300048, util=70.97%
nvme1n1: ios=0/5685248, merge=0/2539, ticks=0/1361944, in_queue=1329024, util=70.38%
nvme4n1: ios=0/5681029, merge=0/2406, ticks=0/1175864, in_queue=1159912, util=66.80%