主板 / 控制器 / 硬盘坏了？

2024-5-27 • tag-icon

centos hard-drive controller

主板 / 控制器 / 硬盘坏了？

在租用的服务器上，我遇到了一些需要精确计时的应用程序的计时问题。服务器是双 Xeon E5410，在 CentOs 5.5 x64 下运行在 Supermicro X7DVL-3 主板上。

我运行的应用程序对计时器很敏感，无论是在负载下还是空闲时，都会不断感知到漂移，尤其是在负载下。我用 atop 和 dd 做了一些调查，发现了一些令人震惊的数字。请注意，我不是 Linux 专家，但肯定有些事情似乎不对劲。

我跑了：

dd bs=4096 if=/dev/zero of=/bigtestfile

生成磁盘活动。无论我将其写入 sda 还是 sdb，我的 atop 中的 DSK 值都会超过 100%，最高达到 1700%。同样，无论我将其写入 sda 还是 sdb 都无关紧要。

DSK |         sdb | busy    675% | read       0 | write    110 | avio   78 ms |

以下是 smartctl 的输出：

# smartctl -A /dev/sda
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0007   165   165   021    Pre-fail  Always       -       2750
  4 Start_Stop_Count        0x0032   100   100   040    Old_age   Always       -       21
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000a   200   200   051    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   065   065   000    Old_age   Always       -       25831
 10 Spin_Retry_Count        0x0012   100   253   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0012   100   253   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       21
194 Temperature_Celsius     0x0022   116   093   000    Old_age   Always       -       27
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0012   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   051    Old_age   Offline      -       0


# smartctl -A /dev/sdb
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   180   180   021    Pre-fail  Always       -       3958
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       22
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   068   068   000    Old_age   Always       -       24087
 10 Spin_Retry_Count        0x0013   100   253   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0013   100   253   051    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       21
194 Temperature_Celsius     0x0022   122   096   000    Old_age   Always       -       25
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0

知道这里出了什么问题吗？主板坏了？两个驱动器都坏了的情况似乎很少见（smartctl 说它们 PASS_，所以在我看来主板是罪魁祸首。

答案1

有些漂移是不可避免的。NTP 等提供的时钟规则有助于消除漂移。Linux 有多种计时器可供选择，其中一些容易受到与负载相关的漂移的影响。在双磁盘系统中，磁盘 I/O 导致漂移并不奇怪，因为存储控制器和时间控制器可能位于同一个南桥芯片上。

HPET 计时器更精确，但需要校正才能与 UTC 保持一致。更精确的计时器需要软件来确保时间不会漂移（例如 ntp）或特殊硬件。

至于过长的 DSK 时间，我曾见过 IOWAIT 攀升到疯狂水平的情况。这是由于磁盘子系统无法满足需求，而您的 dd 命令旨在抛出很多在短时间内磁盘上的数据量。在双磁盘系统中，这似乎……不寻常。我怀疑主板固件中的某个地方存在坏数据路径；硬件故障应该会在 dmesg 中留下尖叫痕迹。

答案2

这很奇怪。从长远来看，我会先尝试重新安装电缆，如果重新安装不起作用，则更换它们。

我见过硬盘开始出现坏扇区和性能严重下降的情况。正如您所说，两个驱动器同时出现故障的可能性很大，这要归咎于控制器或主板。

如果可能的话，我会尝试一次移除一个驱动器，然后再次运行测试，以查看性能问题是否仍然存在，或者是否仅同时出现。

祝你好运。

相关内容