26 个驱动器的硬件 RAID0 写入性能较慢

26 个驱动器的硬件 RAID0 写入性能较慢

我需要一个总磁盘空间至少为 400TB 且写入速度(最好)远高于 2GB/s 的存储服务器,用于一个科学实验,该实验将通过网络共享传输大小约为 10GB 的文件,并决定使用Dell PowerEdge R740xd226 个驱动器,每个驱动器在 RAID0 配置中具有 20TB。

RAID 控制器为 ,Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02)驱动器为DELLEMC Exos X20 - 20TB 512e RPM 7.2K - SAS 12 Gbps。根据规格,Exos X20能够达到 272 MB/s(最高 285 MB/s)。

例如,hardwareluxx.de 的基准测试报告写入速度为 265.4MB/s,顺序写入速度甚至达到 281.1MB/s。Exos X20 的基准测试

因此原则上,RAID0 中的这 26 个驱动器应该能够提供大约 7GB/s 的写入速度。

我在硬件 RAID 配置中将条带大小设置为 1MB,这是最高可能值。使用 创建了一个 LVM 分区EXT4

读取速度检查hdparm已经相当令人失望(2.6GB/s):

# hdparm -Tt /dev/sda

/dev/sda:
 Timing cached reads:   19172 MB in  2.00 seconds = 9595.81 MB/sec
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0d 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 Timing buffered disk reads: 6966 MB in  3.00 seconds = 2321.66 MB/sec

顺序写入测试fio速度低得可笑(400MiB/s):

# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=1M --iodepth=64 --size=10G --readwrite=write
test: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=64
fio-3.16
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=382MiB/s][w=382 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=4485: Fri Jan 13 11:02:30 2023
  write: IOPS=401, BW=402MiB/s (421MB/s)(10.0GiB/25485msec); 0 zone resets
   bw (  KiB/s): min=272384, max=485376, per=99.80%, avg=410624.00, stdev=32440.49, samples=50
   iops        : min=  266, max=  474, avg=401.00, stdev=31.68, samples=50
  cpu          : usr=1.95%, sys=2.67%, ctx=963, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.3%, >=64=99.4%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=0,10240,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=402MiB/s (421MB/s), 402MiB/s-402MiB/s (421MB/s-421MB/s), io=10.0GiB (10.7GB), run=25485-25485msec

Disk stats (read/write):
    dm-0: ios=0/10370, merge=0/0, ticks=0/1605452, in_queue=1605452, util=99.30%, aggrios=0/10349, aggrmerge=0/21, aggrticks=0/1601612, aggrin_queue=1581044, aggrutil=99.23%
  sdb: ios=0/10349, merge=0/21, ticks=0/1601612, in_queue=1581044, util=99.23%

我想知道这里出了什么问题,我很可能忽略了一些显而易见的东西。有什么想法吗?我还没有调整过 EXT4 步长等,但我认为默认设置应该已经能提供可接受的性能。

有关该系统的更多信息:

RAID 控制器:

18:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02)
        DeviceName: Integrated RAID
        Subsystem: Dell PERC H730P Mini
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 34
        NUMA node: 0
        Region 0: I/O ports at 4000 [size=256]
        Region 1: Memory at 9d900000 (64-bit, non-prefetchable) [size=64K]
        Region 3: Memory at 9d800000 (64-bit, non-prefetchable) [size=1M]
        Expansion ROM at <ignored> [disabled]
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
                DevCtl: CorrErr- NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s <2us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s (ok), Width x8 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range BC, TimeoutDis+, NROPrPrP-, LTR-
                         10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-, TPHComp-, ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
                         AtomicOpsCtl: ReqEn-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [a8] MSI: Enable- Count=1/1 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [c0] MSI-X: Enable+ Count=97 Masked-
                Vector table: BAR=1 offset=0000e000
                PBA: BAR=1 offset=0000f000
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                CEMsk:  RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 04000001 1710000f 18080000 b9620497
        Capabilities: [1e0 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
                LaneErrStat: 0
        Capabilities: [1c0 v1] Power Budgeting <?>
        Capabilities: [148 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Kernel driver in use: megaraid_sas
        Kernel modules: megaraid_sas

低于 300us 的延迟(用 测量ioping)是可以的:

# ioping -c 10 .
4 KiB <<< . (ext4 /dev/dm-0): request=1 time=575.2 us (warmup)
4 KiB <<< . (ext4 /dev/dm-0): request=2 time=235.1 us
4 KiB <<< . (ext4 /dev/dm-0): request=3 time=257.0 us
4 KiB <<< . (ext4 /dev/dm-0): request=4 time=269.3 us
4 KiB <<< . (ext4 /dev/dm-0): request=5 time=288.5 us
4 KiB <<< . (ext4 /dev/dm-0): request=6 time=284.8 us
4 KiB <<< . (ext4 /dev/dm-0): request=7 time=272.8 us

以下是输出tune2fs -l

# tune2fs -l /dev/mapper/data--vg-data
tune2fs 1.45.5 (07-Jan-2020)
Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          57c19b70-c1f3-4af9-85ec-ae3ac191c7a7
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              335544320
Block count:              5368709120
Reserved block count:     268435456
Free blocks:              5347083745
Free inodes:              335544309
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         2048
Inode blocks per group:   128
Flex block group size:    16
Filesystem created:       Fri Jan 13 10:51:35 2023
Last mount time:          Fri Jan 13 10:51:37 2023
Last write time:          Fri Jan 13 10:51:37 2023
Mount count:              1
Maximum mount count:      -1
Last checked:             Fri Jan 13 10:51:35 2023
Check interval:           0 (<none>)
Lifetime writes:          41 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      285a1147-264c-4e28-87b8-22e27d407d98
Journal backup:           inode blocks
Checksum type:            crc32c
Checksum:                 0x41f0d8c7

答案1

我认为这是受到RAID本身背板的限制。

我有一个4U存储服务器,也装了4GB缓存的LSI MegaRAID SAS-3 3108,配置了不同卷的不同RAID级别,做了多次测试,发现最大写入速度也是2.6GB/s左右,跟你的一模一样!

最后我放弃了,认为这是硬件限制。

以下是我的测试结果的简要总结:

配置了 4 个卷的 RAID 6。每个卷由 8 块 18TB 的 Westdigit 硬盘组成。写入一个卷的最大速度为 1.6GB/s。这受限于 6 个驱动器(8 个驱动器的 raid6 只有 6 个独立驱动器)乘以 260MB/s(一个驱动器速度)。但同时写入两个卷只能达到约 2.6GB/s 的速度。当一起写入三或四个卷时,总速度也是 2.6GB/s。

相关内容