诊断 RAID / Fedora 14 上**非常**慢的写入速度

诊断 RAID / Fedora 14 上**非常**慢的写入速度

我们有一台运行 Fedora 14 的 Dell PowerEdge T110,它作为我们的嵌入式 Linux 构建服务器和 Subversion 服务器。

最近它变得非常慢,无法在新的一天开始之前完成夜间备份。

[编辑] 感谢 User9517 - 我检查了日志,发现有多个来自 MRMON (Mega Raid Monitor) 的消息。任何有关解释这些消息、后续步骤以及如何确定哪个驱动器需要更换的指导都会有所帮助。

Dec 20 09:02:32 localhost MR_MONITOR[2153]: <MRMON096> Controller ID:  0   PD Predictive failure:  #012    -:-:2
Dec 20 09:06:44 localhost MR_MONITOR[2153]: <MRMON113> Controller ID:  0   Unexpected sense:   PD  #012    =   -:-:2No defect spare location available,   CDB   =    0x28 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00    ,   Sense   =    0x70 0x00 0x04 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x32 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
Dec 20 09:09:44 localhost MR_MONITOR[2153]: <MRMON096> Controller ID:  0   PD Predictive failure:  #012    -:-:2

[/编辑]

我正在寻求帮助来找出故障。我绝对不是这方面的专家,最初设置系统的人已经不在了。

每晚的备份大约为 6 GB 的 tgz 文件,从晚上 8 点开始。通常在凌晨 4 点左右完成(包括复制到外部驱动器)。每周的备份大约为 45 GB,通常在周五晚上 8 点开始,周六上午 11 点完成。

除了备份之外,即使没有运行备份过程,机器的响应也明显很慢。

以下是我目前收集到的信息:

有一个 RAID 控制器 DELL PERC H200L,附有四个 Seagate 1TB 硬盘 (ST31000424SS)。我思考它设置为 RAID 10,但我不知道如何访问此控制器的配置。我相信是 RAID 10,因为有 4 个驱动器,并且 vgdisplay 显示 4 个驱动器上总共 4 TB 中已分配 1.81 TB。

[root@fedorabox backup]# vgdisplay
  --- Volume group ---
  VG Name               vg_fedorabox2
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  8
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                5
  Open LV               5
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               1.81 TiB
  PE Size               32.00 MiB
  Total PE              59263
  Alloc PE / Size       28480 / 890.00 GiB
  Free  PE / Size       30783 / 961.97 GiB

我看不到机器中的任何其他实际驱动器,因此我猜测启动分区(/dev/sdb1)是以某种方式从 4 个驱动器中划分出来的。

(/dev/sda 是用于备份的外部硬盘 - 但这不是问题。早上我们到达时,/backup 分区上仍在生成备份。尚未开始复制到 USB 连接驱动器)

[root@fedorabox backup]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_fedorabox2-LogVol00
                      9.9G  5.2G  4.3G  55% /
tmpfs                 2.0G  932K  2.0G   1% /dev/shm
/dev/sdb1             504M   56M  423M  12% /boot
/dev/mapper/vg_fedorabox2-LogVol03
                      394G  221G  153G  60% /home
/dev/mapper/vg_fedorabox2-LogVol02
                       99G   29G   65G  32% /shared
/dev/mapper/vg_fedorabox2-LogVol01
                       30G   11G   18G  37% /usr
/dev/sda2             5.5T  2.6T  3.0T  47% /mnt/root/usbbackup2
/dev/mapper/vg_fedorabox2-LogVol04
                      345G  363M  327G   1% /backup

就像我在问题中说的那样,写入速度是非常慢的:

[root@fedorabox backup]# dd if=/dev/zero of=/backup/tmp/test.out bs=512 count=32 oflag=dsync
32+0 records in
32+0 records out
16384 bytes (16 kB) copied, 40.382 s, 0.4 kB/s
[root@fedorabox backup]# dd of=/dev/null if=/backup/tmp/test.out bs=512 count=32 oflag=dsync
32+0 records in
32+0 records out
16384 bytes (16 kB) copied, 3.5087e-05 s, 467 MB/s

我可以使用 smartctl 以 /dev/sg2 到 /dev/sg5 的形式访问这四个驱动器。输出如下所示。我不知道正常读数是多少更正错误但我注意到第二和第四个驱动器(/dev/sg3、sg5)已列出未更正的错误以供读取和验证。

对下一步有什么建议吗 - 未更正的错误是正常的还是令人担忧的?这是导致速度缓慢的原因吗,还是我应该考虑其他问题?

关于如何更换驱动器以及如何访问 RAID 配置有什么建议吗?

[root@fedorabox /]# smartctl -a /dev/sg2
smartctl 5.40 2010-10-16 r3189 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Device: SEAGATE  ST1000NM0001     Version: PS06
Serial number: Z1N2LEDW
Device type: disk
Transport protocol: SAS
Local Time is: Mon Dec 19 12:10:20 2022 EST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
Log Sense failed, IE page [scsi response fails sanity test]

Current Drive Temperature:     37 C
Drive Trip Temperature:        68 C
Manufactured in week 33 of year 2012
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  71
Elements in grown defect list: 36
Vendor (Seagate) cache information
  Blocks sent to initiator = 2805494200
  Blocks received from initiator = 1072424796
  Blocks read from cache and sent to initiator = 19110177
  Number of read and write commands whose size <= segment size = 826634038
  Number of read and write commands whose size > segment size = 5264167
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 11183.37
  number of minutes until next internal SMART test = 43

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   3996823525        0         0  3996823525          0        130.509           0
write:         0        0         0         0          0      62619.327           0
verify: 1594450892        0         0  1594450892          0      51866.259           0

Non-medium error count:        9

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  32   11182                 - [-   -    -]

Long (extended) Self Test duration: 11100 seconds [185.0 minutes]
[root@fedorabox /]# smartctl -a /dev/sg3
smartctl 5.40 2010-10-16 r3189 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Device: SEAGATE  ST31000424SS     Version: KS68
Serial number: 9WK3JSJV
Device type: disk
Transport protocol: SAS
Local Time is: Mon Dec 19 12:10:44 2022 EST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
Log Sense failed, IE page [scsi response fails sanity test]

Current Drive Temperature:     37 C
Drive Trip Temperature:        68 C
Manufactured in week 06 of year 2011
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  81
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  81
Elements in grown defect list: 21
Vendor (Seagate) cache information
  Blocks sent to initiator = 1872227385
  Blocks received from initiator = 3603107317
  Blocks read from cache and sent to initiator = 53905772
  Number of read and write commands whose size <= segment size = 1041622488
  Number of read and write commands whose size > segment size = 5288254
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 77337.02
  number of minutes until next internal SMART test = 16

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   1454822558        3         0  1454822561   1454822585       2465.838          21
write:         0        0         0         0          0      64012.923           0
verify: 2113323340      143         0  2113323483   2113323510      49057.393          17

Non-medium error count:        4

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  16     643                 - [-   -    -]
# 2  Background short  Completed                  16       5                 - [-   -    -]
# 3  Background long   Completed                  16       5                 - [-   -    -]

Long (extended) Self Test duration: 11100 seconds [185.0 minutes]
[root@fedorabox /]# smartctl -a /dev/sg4
smartctl 5.40 2010-10-16 r3189 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Device: SEAGATE  ST31000424SS     Version: KS68
Serial number: 9WK3H8DW
Device type: disk
Transport protocol: SAS
Local Time is: Mon Dec 19 12:11:02 2022 EST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
Log Sense failed, IE page [scsi response fails sanity test]

Current Drive Temperature:     38 C
Drive Trip Temperature:        68 C
Manufactured in week 06 of year 2011
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  76
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  76
Elements in grown defect list: 1
Vendor (Seagate) cache information
  Blocks sent to initiator = 1437832391
  Blocks received from initiator = 3080050213
  Blocks read from cache and sent to initiator = 2689371046
  Number of read and write commands whose size <= segment size = 3306395247
  Number of read and write commands whose size > segment size = 5018225
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 77337.17
  number of minutes until next internal SMART test = 58

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   1514637706     1007         0  1514638713   1514638713    1576907.538           0
write:         0        0         0         0          0      61240.330           0
verify: 1697580124       32         0  1697580156   1697580157      48889.638           0

Non-medium error count:       27

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  16      18                 - [-   -    -]
# 2  Background short  Completed                  16       5                 - [-   -    -]
# 3  Background long   Completed                  16       5                 - [-   -    -]

Long (extended) Self Test duration: 11100 seconds [185.0 minutes]
[root@fedorabox /]# smartctl -a /dev/sg5
smartctl 5.40 2010-10-16 r3189 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Device: SEAGATE  ST31000424SS     Version: KS68
Serial number: 9WK3FCZ6
Device type: disk
Transport protocol: SAS
Local Time is: Mon Dec 19 12:11:41 2022 EST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
Log Sense failed, IE page [scsi response fails sanity test]

Current Drive Temperature:     38 C
Drive Trip Temperature:        68 C
Manufactured in week 06 of year 2011
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  81
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  81
Elements in grown defect list: 4096
Vendor (Seagate) cache information
  Blocks sent to initiator = 923606853
  Blocks received from initiator = 3074269061
  Blocks read from cache and sent to initiator = 3237322768
  Number of read and write commands whose size <= segment size = 3044372010
  Number of read and write commands whose size > segment size = 5024782
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 77336.67
  number of minutes until next internal SMART test = 53

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   2058067359   277563         0  2058344922   2058345511    1420772.201         555
write:         0        0         0         0          0      62186.800           0
verify: 2750944424     2205         0  2750946629   2750946631      50834.359           1

Non-medium error count:      167

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  16     643                 - [-   -    -]
# 2  Background short  Completed                  16       5                 - [-   -    -]
# 3  Background long   Completed                  16       5                 - [-   -    -]

Long (extended) Self Test duration: 11100 seconds [185.0 minutes]

答案1

Dell PERC 本质上是 LSI MegaRAID SAS 的品牌重塑。

查看您的计算机lspci -k以查看它使用哪个驱动程序。很有可能megaraid_sas。您成功使用 MegaRAID Monitor 的事实表明情况一定如此。因此,使用该megacli软件包可以从 Linux 控制您的 RAID 控制器。

不过,对于那么老的 Fedora 版本,在哪里可以找到它仍然是一个问题。尝试查找https://hwraid.le-vert.net为了超级RAID SAS或者,可能超级RAID对于软件。

该软件有一个小型内联提醒器(运行megacli -h),并且还描述了MegaRAID SAS 软件用户指南您可以从博通(谁收购了 Avago,谁收购了 LSI)。互联网上也有一些备忘单。

例如,您可以从获取诊断信息开始:

megacli -AdpAllInfo -aALL
megacli -AdpPR -info -aALL
megacli -LdPdInfo -aALL
megacli -AdpBbuCmd -GetBbuStatus -aALL
megacli -AdpEventLog -GetEventLogInfo -aALL

这些命令分别执行以下操作:

  • 获取控制器的状态和一般警报(包括故障设备的数量)
  • 获取巡检读取操作的状态(定期读取所有设备以尽早发现故障设备)
  • 获取逻辑磁盘及其组成物理磁盘他们的地位如果有故障磁盘,您将看到哪些磁盘以及它们位于哪个插槽中。
  • 获取缓存电池状态
  • 获取适配器事件日志;这可以帮助您准确确定何时以及在何种情况下检测到问题。

即使您拥有 RAID,也并不意味着您无需监控磁盘和阵列的运行状况。只有在正确监控和维护的情况下,RAID 才有助于避免停机。smartmontools甚至可以监控某些硬件 RAID 控制器后面的磁盘;使用它!

是时候忘记那些“如果能用就别碰”和“没坏就别修”的口头禅了。这些与快速发展的世界无关。考虑一下:旧版本的操作系统已经坏了,因为太旧了。一个称职的管理员将要修复明显“未损坏”的系统以保持它们未损坏。

更糟糕的是,像 Fedora 这样古老的(10 年)非 LTS 系统已经严重损坏。在这样的发行版上托管任何具有商业重要性的东西的想法在设计上就是错误的;如果它是 CentOS(10 年前是 LTS,目前你会使用 Oracle Linux、AlmaLinux 或 Rocky Linux),它就不会很糟糕,但 Fedora 一直都不适合用作生产服务器。所以即使它只有两年的历史,你也必须更换它。

最好始终安装硬件管理工具(megacliipmiutil等等)。你永远不知道什么时候需要它们,然后它们可能已经对你不可用了,所以提前铺好稻草。

相关内容