SSD 性能缓慢 - IBM X3650 M4 (7915)

SSD 性能缓慢 - IBM X3650 M4 (7915)

我为开发目的设置了一个测试环境。它由一台 IBM X3650 M4 (7915) 服务器组成,该服务器具有:

  • 2 个英特尔至强 E2690 @ 2.90GHz
  • 96GB 1333MHz ECC RAM
  • 2 个 146GB 15k rpm 硬盘
  • 6 x SSD 525GB(Crucial MX300)
  • 嵌入式 ServeRaid m5110e 处于 JBOD 模式,无缓存
  • Ubuntu 服务器 16.10
  • md HDD(RAID0)和SSD(RAID10)上的 RAID 软件

我无法完全绕过 RAID 控制器,因为它集成在主板上并且没有专用的 HBA 卡(我应该买一个吗?),但我将其设置为 JBOD 模式。

我将这些 SSD 作为单个磁盘进行了多次测试,包括 RAID10 和 RAID0 配置。我观察到软件 RAID 的预期行为,但单个磁盘的预期行为则不然:RAID 可以扩展(对我来说没问题),但单个 SSD 的运行速度只有预期 IOPS 的一半!

测试使用fiostoragereviews.com 描述的配置进行(关联)。

以下是所有 6 个 SSD 平均运行的汇总图(每个 SSD 运行 1 x 60 秒):

SSD IOPS 与 4k 100% 随机读取和 8k 70% 随机读取和 30% 随机写入工作负载的 IO 深度

根据各种基准测试(storagereview.com、tomshardware.com 等)和官方规格,这些磁盘应达到双倍随机读取 IOPS。例如:

  • 对于 4k 工作负载,tom 的硬件在 32 IO 深度下最高读取速度为 92358 IOPS,而我的最高读取速度为 ~37400 IOPS(关联)。
  • storagereview.com 运行的基准测试略有不同,但它们都给出了完全不同的结果 - 4k 对齐读取的 IOPS 约为 90k(关联)。
  • Hardware.info 对 1TB 型号给出了相同的结果(关联)。

我优化了各种/sys/block/sd*参数,/dev/sd*scheduler、、、等。nr_requestsrotationalfifo_batch

我应该寻找什么?

更新 1

我忘了说磁盘超配了 25%,所以以下输出中报告的总体大小大约是 525GB 的 75%。无论如何,超配前后的 IOPS 从未超过 37k 的限制。

输出hdparm -I /dev/sdc

/dev/sdc:

ATA device, with non-removable media
    Model Number:       Crucial_CT525MX300SSD1                  
    Serial Number:      163113837E16
    Firmware Revision:  M0CR031
    Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
    Used: unknown (minor revision code 0x006d) 
    Supported: 10 9 8 7 6 5 
    Likely used: 10
Configuration:
    Logical     max current
    cylinders   16383   16383
    heads       16  16
    sectors/track   63  63
    --
    CHS current addressable sectors:   16514064
    LBA    user addressable sectors:  268435455
    LBA48  user addressable sectors:  769208076
    Logical  Sector size:                   512 bytes
    Physical Sector size:                   512 bytes
    Logical Sector-0 offset:                  0 bytes
    device size with M = 1024*1024:      375589 MBytes
    device size with M = 1000*1000:      393834 MBytes (393 GB)
    cache/buffer size  = unknown
    Form Factor: 2.5 inch
    Nominal Media Rotation Rate: Solid State Device
Capabilities:
    LBA, IORDY(can be disabled)
    Queue depth: 32
    Standby timer values: spec'd by Standard, with device specific minimum
    R/W multiple sector transfer: Max = 16  Current = 16
    Advanced power management level: 254
    DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
         Cycle time: min=120ns recommended=120ns
    PIO: pio0 pio1 pio2 pio3 pio4 
         Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
    Enabled Supported:
       *    SMART feature set
            Security Mode feature set
       *    Power Management feature set
       *    Write cache
       *    Look-ahead
       *    WRITE_BUFFER command
       *    READ_BUFFER command
       *    NOP cmd
       *    DOWNLOAD_MICROCODE
       *    Advanced Power Management feature set
       *    48-bit Address feature set
       *    Mandatory FLUSH_CACHE
       *    FLUSH_CACHE_EXT
       *    SMART error logging
       *    SMART self-test
       *    General Purpose Logging feature set
       *    WRITE_{DMA|MULTIPLE}_FUA_EXT
       *    64-bit World wide name
       *    IDLE_IMMEDIATE with UNLOAD
            Write-Read-Verify feature set
       *    WRITE_UNCORRECTABLE_EXT command
       *    {READ,WRITE}_DMA_EXT_GPL commands
       *    Segmented DOWNLOAD_MICROCODE
            unknown 119[8]
       *    Gen1 signaling speed (1.5Gb/s)
       *    Gen2 signaling speed (3.0Gb/s)
       *    Gen3 signaling speed (6.0Gb/s)
       *    Native Command Queueing (NCQ)
       *    Phy event counters
       *    NCQ priority information
       *    READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
       *    DMA Setup Auto-Activate optimization
            Device-initiated interface power management
       *    Software settings preservation
            Device Sleep (DEVSLP)
       *    SMART Command Transport (SCT) feature set
       *    SCT Write Same (AC2)
       *    SCT Features Control (AC4)
       *    SCT Data Tables (AC5)
       *    reserved 69[3]
       *    reserved 69[4]
       *    reserved 69[7]
       *    DOWNLOAD MICROCODE DMA command
       *    WRITE BUFFER DMA command
       *    READ BUFFER DMA command
       *    Data Set Management TRIM supported (limit 8 blocks)
       *    Deterministic read ZEROs after TRIM
Security: 
    Master password revision code = 65534
        supported
    not enabled
    not locked
    not frozen
    not expired: security count
        supported: enhanced erase
    2min for SECURITY ERASE UNIT. 2min for ENHANCED SECURITY ERASE UNIT. 
Logical Unit WWN Device Identifier: 500a075113837e16
    NAA     : 5
    IEEE OUI    : 00a075
    Unique ID   : 113837e16
Device Sleep:
    DEVSLP Exit Timeout (DETO): 50 ms (drive)
    Minimum DEVSLP Assertion Time (MDAT): 10 ms (drive)
Checksum: correct

输出fdisk -l /dev/sdc

Disk /dev/sdc: 366.8 GiB, 393834534912 bytes, 769208076 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

输出cat /sys/block/sdc/queue/scheduler

noop [deadline] cfq

输出dmesg | grep "ahci\|ncq"

[    5.490677] ahci 0000:00:1f.2: version 3.0
[    5.490901] ahci 0000:00:1f.2: AHCI 0001.0300 32 slots 6 ports 1.5 Gbps 0x2 impl SATA mode
[    5.498675] ahci 0000:00:1f.2: flags: 64bit ncq sntf led clo pio slum part ems apst 
[    5.507315] scsi host1: ahci
[    5.507435] scsi host2: ahci
[    5.507529] scsi host3: ahci
[    5.507620] scsi host4: ahci
[    5.507708] scsi host5: ahci
[    5.507792] scsi host6: ahci
[   14.382326] Modules linked in: ioatdma(+) ipmi_si(+) ipmi_msghandler mac_hid shpchp lpc_ich ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi coretemp ip_tables x_tables autofs4 btrfs raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear raid10 raid1 ses enclosure scsi_transport_sas crct10dif_pclmul crc32_pclmul ghash_clmulni_intel igb aesni_intel hid_generic dca aes_x86_64 lrw ptp glue_helper ablk_helper ahci usbhid cryptd pps_core wmi hid libahci megaraid_sas i2c_algo_bit fjes

深入查看dmesg输出,以下奇怪的消息以粗体显示并且非常可疑:

...
[    0.081418] CPU: Physical Processor ID: 0
[    0.081421] CPU: Processor Core ID: 0
[    0.081427] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
[    0.081430] ENERGY_PERF_BIAS: View and update with x86_energy_perf_policy(8)
[    0.081434] mce: CPU supports 20 MCE banks
[    0.081462] CPU0: Thermal monitoring enabled (TM1)
...
[    0.341838] cpuidle: using governor menu
[    0.341841] PCCT header not found.
[    0.341868] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
[    0.341873] ACPI: bus type PCI registered
...
[    1.313494] NET: Registered protocol family 1
[    1.313857] pci 0000:16:00.0: [Firmware Bug]: VPD access disabled
[    1.314223] pci 0000:04:00.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff]
...
[    1.591739] PCI: Probing PCI hardware (bus 7f)
[    1.591761] ACPI: \: failed to evaluate _DSM (0x1001)
[    1.591764] PCI host bridge to bus 0000:7f
...
[    1.595018] PCI: root bus ff: using default resources
[    1.595019] PCI: Probing PCI hardware (bus ff)
[    1.595039] ACPI: \: failed to evaluate _DSM (0x1001)
...
[    1.854466] ACPI: Power Button [PWRF]
[    1.855209] ERST: Can not request [mem 0x7e908000-0x7e909bff] for ERST.
[    1.855492] GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC.
...

更新2

我的问题不是这个问题因为我的 IOPS 始终是单个 SSD 预期 IOPS 的一半,而不是整个 RAID 的预期 IOPS,即使在 IOPS 非常小(<10k)的低 IO 深度下也是如此。

看看上面的图表:IO 深度为 1 时,单个 SSD 的平均 IOPS 达到 5794,而每个 SSD 至少应达到 8000,这远远超出了我的上限 40k。我没有写下 RAID 结果,因为它们与预期行为一致,但这里是:IO 深度为 16 和 32 时,RAID10 达到约 120k IOPS(由于 RAID10 镜像惩罚,2 个磁盘中每 6 个磁盘约 40k IOPS,因此 3 个磁盘为 40k)。

我也认为我的嵌入式 RAID 卡可以代表瓶颈,但我找不到明确的答案。例如,我观察到,并行运行fio每个 SSD 的测试(同时运行 6 个测试,每个测试都在一个 SSD 上)会使单个 SSD 的 IOPS 减半,IO 深度为 16 和 32。这使 IOPS 从 40k 降到了 20k。

答案1

让我们尝试以下步骤,分析单个设备sda

  • 检查 SSD 的私有 DRAM 缓存是否已启用,方法是发出hdparm -I /dev/sda(在此处发布输出)
  • 确保您的分区(如果有)正确对齐(显示输出fdisk -l /dev/sda
  • 将调度程序设置为deadline
  • 确保 NCQ 已启用dmesg | grep -i ncq(再次在此处发布输出)

相关内容