我需要一个总磁盘空间至少为 400TB 且写入速度(最好)远高于 2GB/s 的存储服务器,用于一个科学实验,该实验将通过网络共享传输大小约为 10GB 的文件,并决定使用Dell PowerEdge R740xd2
26 个驱动器,每个驱动器在 RAID0 配置中具有 20TB。
RAID 控制器为 ,Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02)
驱动器为DELLEMC Exos X20 - 20TB 512e RPM 7.2K - SAS 12 Gbps
。根据规格,Exos X20
能够达到 272 MB/s(最高 285 MB/s)。
例如,hardwareluxx.de 的基准测试报告写入速度为 265.4MB/s,顺序写入速度甚至达到 281.1MB/s。Exos X20 的基准测试
因此原则上,RAID0 中的这 26 个驱动器应该能够提供大约 7GB/s 的写入速度。
我在硬件 RAID 配置中将条带大小设置为 1MB,这是最高可能值。使用 创建了一个 LVM 分区EXT4
。
读取速度检查hdparm
已经相当令人失望(2.6GB/s):
# hdparm -Tt /dev/sda
/dev/sda:
Timing cached reads: 19172 MB in 2.00 seconds = 9595.81 MB/sec
SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0d 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Timing buffered disk reads: 6966 MB in 3.00 seconds = 2321.66 MB/sec
顺序写入测试fio
速度低得可笑(400MiB/s):
# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=1M --iodepth=64 --size=10G --readwrite=write
test: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=64
fio-3.16
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=382MiB/s][w=382 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=4485: Fri Jan 13 11:02:30 2023
write: IOPS=401, BW=402MiB/s (421MB/s)(10.0GiB/25485msec); 0 zone resets
bw ( KiB/s): min=272384, max=485376, per=99.80%, avg=410624.00, stdev=32440.49, samples=50
iops : min= 266, max= 474, avg=401.00, stdev=31.68, samples=50
cpu : usr=1.95%, sys=2.67%, ctx=963, majf=0, minf=9
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.3%, >=64=99.4%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=0,10240,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
WRITE: bw=402MiB/s (421MB/s), 402MiB/s-402MiB/s (421MB/s-421MB/s), io=10.0GiB (10.7GB), run=25485-25485msec
Disk stats (read/write):
dm-0: ios=0/10370, merge=0/0, ticks=0/1605452, in_queue=1605452, util=99.30%, aggrios=0/10349, aggrmerge=0/21, aggrticks=0/1601612, aggrin_queue=1581044, aggrutil=99.23%
sdb: ios=0/10349, merge=0/21, ticks=0/1601612, in_queue=1581044, util=99.23%
我想知道这里出了什么问题,我很可能忽略了一些显而易见的东西。有什么想法吗?我还没有调整过 EXT4 步长等,但我认为默认设置应该已经能提供可接受的性能。
有关该系统的更多信息:
RAID 控制器:
18:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02)
DeviceName: Integrated RAID
Subsystem: Dell PERC H730P Mini
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 34
NUMA node: 0
Region 0: I/O ports at 4000 [size=256]
Region 1: Memory at 9d900000 (64-bit, non-prefetchable) [size=64K]
Region 3: Memory at 9d800000 (64-bit, non-prefetchable) [size=1M]
Expansion ROM at <ignored> [disabled]
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
DevCtl: CorrErr- NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s <2us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s (ok), Width x8 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range BC, TimeoutDis+, NROPrPrP-, LTR-
10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-, TPHComp-, ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
AtomicOpsCtl: ReqEn-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [a8] MSI: Enable- Count=1/1 Maskable+ 64bit+
Address: 0000000000000000 Data: 0000
Masking: 00000000 Pending: 00000000
Capabilities: [c0] MSI-X: Enable+ Count=97 Masked-
Vector table: BAR=1 offset=0000e000
PBA: BAR=1 offset=0000f000
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 04000001 1710000f 18080000 b9620497
Capabilities: [1e0 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
LaneErrStat: 0
Capabilities: [1c0 v1] Power Budgeting <?>
Capabilities: [148 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Kernel driver in use: megaraid_sas
Kernel modules: megaraid_sas
低于 300us 的延迟(用 测量ioping
)是可以的:
# ioping -c 10 .
4 KiB <<< . (ext4 /dev/dm-0): request=1 time=575.2 us (warmup)
4 KiB <<< . (ext4 /dev/dm-0): request=2 time=235.1 us
4 KiB <<< . (ext4 /dev/dm-0): request=3 time=257.0 us
4 KiB <<< . (ext4 /dev/dm-0): request=4 time=269.3 us
4 KiB <<< . (ext4 /dev/dm-0): request=5 time=288.5 us
4 KiB <<< . (ext4 /dev/dm-0): request=6 time=284.8 us
4 KiB <<< . (ext4 /dev/dm-0): request=7 time=272.8 us
以下是输出tune2fs -l
:
# tune2fs -l /dev/mapper/data--vg-data
tune2fs 1.45.5 (07-Jan-2020)
Filesystem volume name: <none>
Last mounted on: <not available>
Filesystem UUID: 57c19b70-c1f3-4af9-85ec-ae3ac191c7a7
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 335544320
Block count: 5368709120
Reserved block count: 268435456
Free blocks: 5347083745
Free inodes: 335544309
First block: 0
Block size: 4096
Fragment size: 4096
Group descriptor size: 64
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 2048
Inode blocks per group: 128
Flex block group size: 16
Filesystem created: Fri Jan 13 10:51:35 2023
Last mount time: Fri Jan 13 10:51:37 2023
Last write time: Fri Jan 13 10:51:37 2023
Mount count: 1
Maximum mount count: -1
Last checked: Fri Jan 13 10:51:35 2023
Check interval: 0 (<none>)
Lifetime writes: 41 MB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 32
Desired extra isize: 32
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: 285a1147-264c-4e28-87b8-22e27d407d98
Journal backup: inode blocks
Checksum type: crc32c
Checksum: 0x41f0d8c7
答案1
我认为这是受到RAID本身背板的限制。
我有一个4U存储服务器,也装了4GB缓存的LSI MegaRAID SAS-3 3108,配置了不同卷的不同RAID级别,做了多次测试,发现最大写入速度也是2.6GB/s左右,跟你的一模一样!
最后我放弃了,认为这是硬件限制。
以下是我的测试结果的简要总结:
配置了 4 个卷的 RAID 6。每个卷由 8 块 18TB 的 Westdigit 硬盘组成。写入一个卷的最大速度为 1.6GB/s。这受限于 6 个驱动器(8 个驱动器的 raid6 只有 6 个独立驱动器)乘以 260MB/s(一个驱动器速度)。但同时写入两个卷只能达到约 2.6GB/s 的速度。当一起写入三或四个卷时,总速度也是 2.6GB/s。