我有一个带有两个 IRST RAID1 的系统:sda
+ sdb
(2TB)、sdc
+ sdd
(1TB)(在 Linux 中)
每对磁盘都是在一个订单中购买的,即它们是相同年龄的相同磁盘驱动器。
2TB RAID 包含操作系统(Windows、Linux)和各种数据分区,而 1TB RAID 包含一些非必要软件)。
1TB RAID 仅供 Windows 使用,而 2TB 分区则供两个操作系统使用。
现在我注意到(通过smartd
Linux)sdc
错误数量正在增加:
smartd[2008]: Device: /dev/sdc [SAT], ATA error count increased from 628 to 651
这是唯一一个错误数增加的。具体来说,磁盘 ( HGST HTS541010A9E680
) 没有读取错误、没有待处理扇区和没有重定向扇区。磁盘还通过了长时间的自检。
更仔细地检查错误,它看起来像这样:
Device Error Count: 651 (device log contains only the most recent 4 errors)
...
Error 651 [2] occurred at disk power-on lifetime: 4947 hours (206 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
84 -- 51 00 11 00 00 19 0e 07 8f 09 00 Error: ICRC, ABRT at LBA = 0x190e078f = 420349839
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 20 00 28 00 00 19 0e 0e 40 40 00 00:00:57.526 READ FPDMA QUEUED
60 00 20 00 20 00 00 19 0e 0e 80 40 00 00:00:57.526 READ FPDMA QUEUED
60 00 20 00 18 00 00 19 0e 0c c0 40 00 00:00:57.526 READ FPDMA QUEUED
60 00 20 00 10 00 00 19 0e 0d 00 40 00 00:00:57.526 READ FPDMA QUEUED
60 00 20 00 08 00 00 19 0e 0d 40 40 00 00:00:57.526 READ FPDMA QUEUED
另一个错误也发生在 LBA 420349839(并且记录的另外两个错误有不同的 LBA)。此外,导致错误的命令始终是READ FPDMA QUEUED
。
在 Linux 中,传输统计信息看起来也不错(在udma6
):
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 2 0 Command failed due to ICRC error
0x0002 2 0 R_ERR response for data FIS
0x0003 2 0 R_ERR response for device-to-host data FIS
0x0004 2 0 R_ERR response for host-to-device data FIS
0x0005 2 0 R_ERR response for non-data FIS
0x0006 2 0 R_ERR response for device-to-host non-data FIS
0x0007 2 0 R_ERR response for host-to-device non-data FIS
0x0009 2 4 Transition from drive PhyRdy to drive PhyNRdy
0x000a 2 4 Device-to-host register FISes sent due to a COMRESET
0x000b 2 0 CRC errors within host-to-device FIS
0x000d 2 0 Non-CRC errors within host-to-device FIS
即使以最大速度读取块后,这些计数器也没有增加。最初我怀疑是电缆有问题或松动,或者是无线电干扰。
所以我想知道(因为许多文件由 Windows 从 1TB RAID 读取):此错误是否可能是由于磁盘是 RAID1 的一部分、是英特尔芯片组 ( 8086:2822 (rev 05)
) 或正在运行 Windows 10?此外,是否有方法将错误消息中的 LBA 映射到 RAID 上 NTFS 分区上的文件
RAID中的另一个磁盘正好有一个这样的错误:
SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
Device Error Count: 1
...
Error 1 [0] occurred at disk power-on lifetime: 3163 hours (131 days + 19 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
84 -- 51 00 11 00 00 00 03 72 a7 00 00 Error: ICRC, ABRT at LBA = 0x000372a7 = 225959
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 40 00 00 00 00 00 03 72 78 40 00 00:00:59.573 READ FPDMA QUEUED
60 00 20 00 08 00 00 00 03 41 60 40 00 00:00:59.564 READ FPDMA QUEUED
60 00 80 00 00 00 00 00 03 40 a8 40 00 00:00:59.563 READ FPDMA QUEUED
60 00 70 00 00 00 00 00 03 1c d0 40 00 00:00:59.562 READ FPDMA QUEUED
60 00 30 00 00 00 00 00 03 1c 88 40 00 00:00:59.562 READ FPDMA QUEUED