我当前的存储设置包括 Linux 机器中的两个传统 HDD 和两个 SSD,每个都在自己的 RAID 1 阵列上,并通过 luks 加密。我有一个故事,而不是一个具体的问题。
一年多以来,我的某些驱动器的内核日志中随机出现“硬重置链接”错误。我会将有问题的驱动器 RMA,新驱动器会使问题停止。几个月后,我最终会在看似随机的时间再次看到相同的错误。驱动器将在 RAID 中被标记为故障,并且不再显示在 中fdisk -l
。我会重新启动计算机,驱动器会再次显示,我可以重新添加到阵列中,它会重建。迟早这个问题会再次发生,通常是几个小时后。
大约六个月前,我用 SSD 替换了两个传统 HDD,希望它们的故障率不会像传统驱动器那么高。然而,在过去几天里,我开始遇到一个新 SSD 和一个传统驱动器的问题。
我开始发现一种模式。我买了一个新硬盘,几个月后就开始出现问题。我一直以为这是由于 HDD 故障率高,但现在 SSD 也出现了同样的问题,所以我认为这不是硬盘的问题。还有什么问题?自从出现问题以来,我已经安装了多个操作系统,所以我想排除软件问题。剩下的就是 SATA 电缆或主板。磁盘加密是否会给硬盘带来太大的压力?我可以做些什么来确定更多信息吗?一如既往地感谢。
下面是dmesg
几个月前我遇到同样问题时提出的一个问题的输出。
[43161.734107] ata3: ATA_REG 0x41 ERR_REG 0x84
[43161.734110] ata3: tag : dhfis dmafis sdbfis sactive
[43161.734113] ata3: tag 0x0: 1 1 0 1
[43161.734123] ata3.00: exception Emask 0x1 SAct 0x1 SErr 0x180000 action 0x6 frozen
[43161.734127] ata3.00: Ata error. fis:0x21
[43161.734130] ata3: SError: { 10B8B Dispar }
[43161.734134] ata3.00: failed command: READ FPDMA QUEUED
[43161.734142] ata3.00: cmd 60/08:00:a8:03:00/00:00:00:00:00/40 tag 0 ncq 4096 in
[43161.734144] res 41/84:04:a8:03:00/84:00:00:00:00/40 Emask 0x10 (ATA bus error)
[43161.734148] ata3.00: status: { DRDY ERR }
[43161.734150] ata3.00: error: { ICRC ABRT }
[43161.734155] ata3: hard resetting link
[43161.734158] ata3: nv: skipping hardreset on occupied port
[43162.220095] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[43162.260202] ata3.00: model number mismatch 'WDC WD2002FAEX-007BA0' != 'C WD2002FAEX-007BA0 �'
[43162.260206] ata3.00: revalidation failed (errno=-19)
[43162.260211] ata3.00: limiting speed to UDMA/133:PIO2
[43167.220123] ata3: hard resetting link
[43167.220127] ata3: nv: skipping hardreset on occupied port
[43167.710060] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[43167.750228] ata3.00: model number mismatch 'WDC WD2002FAEX-007BA0' != 'C WD2002FAEX-007BA0 �'
[43167.750232] ata3.00: revalidation failed (errno=-19)
[43167.750236] ata3.00: disabled
[43172.710100] ata3: hard resetting link
[43173.620110] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[43173.640455] ata3.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
[43178.620116] ata3: hard resetting link
[43179.530113] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[43179.550748] ata3.00: ATA-8: WDC WD2002FAEX-007BA0, 05.01D05, max UDMA/133
[43179.550753] ata3.00: 3907029168 sectors, multi 16: LBA48 NCQ (depth 31/32)
[43179.570208] ata3.00: model number mismatch 'WDC WD2002FAEX-007BA0' != 'C WD2002FAEX-007BA0 �'
[43179.570213] ata3.00: revalidation failed (errno=-19)
[43179.570220] ata3: limiting SATA link speed to 1.5 Gbps
[43179.570224] ata3.00: limiting speed to UDMA/133:PIO3
[43184.530066] ata3: hard resetting link
[43184.530070] ata3: nv: skipping hardreset on occupied port
[43185.020091] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[43185.060949] ata3.00: configured for UDMA/133
[43185.060969] sd 2:0:0:0: [sdd] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[43185.060974] sd 2:0:0:0: [sdd] Sense Key : Aborted Command [current] [descriptor]
[43185.060980] Descriptor sense data with sense descriptors (in hex):
[43185.060983] 72 0b 47 00 00 00 00 0c 00 0a 80 00 00 00 00 00
[43185.060995] 00 00 03 a8
[43185.061000] sd 2:0:0:0: [sdd] Add. Sense: Scsi parity error
[43185.061006] sd 2:0:0:0: [sdd] CDB: Read(10): 28 00 00 00 03 a8 00 00 08 00
[43185.061017] end_request: I/O error, dev sdd, sector 936
[43185.061023] Buffer I/O error on device sdd, logical block 117
[43185.061044] sd 2:0:0:0: rejecting I/O to offline device
[43185.061048] sd 2:0:0:0: killing request
[43185.061062] ata3: EH complete
[43185.061075] sd 2:0:0:0: rejecting I/O to offline device
[43185.061123] sd 2:0:0:0: rejecting I/O to offline device
[43185.061134] sd 2:0:0:0: rejecting I/O to offline device
[43185.061140] sd 2:0:0:0: rejecting I/O to offline device
[43185.061145] sd 2:0:0:0: [sdd] READ CAPACITY(16) failed
[43185.061147] sd 2:0:0:0: [sdd] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[43185.061152] sd 2:0:0:0: [sdd] Sense not available.
[43185.061155] sd 2:0:0:0: rejecting I/O to offline device
[43185.061166] sd 2:0:0:0: rejecting I/O to offline device
[43185.061175] sd 2:0:0:0: rejecting I/O to offline device
[43185.061185] sd 2:0:0:0: rejecting I/O to offline device
[43185.061193] sd 2:0:0:0: rejecting I/O to offline device
[43185.061198] sd 2:0:0:0: [sdd] READ CAPACITY failed
[43185.061202] sd 2:0:0:0: rejecting I/O to offline device
[43185.061209] sd 2:0:0:0: [sdd] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[43185.061215] sd 2:0:0:0: [sdd] Sense not available.
[43185.061226] sd 2:0:0:0: rejecting I/O to offline device
[43185.061235] sd 2:0:0:0: rejecting I/O to offline device
[43185.061245] sd 2:0:0:0: rejecting I/O to offline device
[43185.061254] sd 2:0:0:0: rejecting I/O to offline device
[43185.061263] sd 2:0:0:0: rejecting I/O to offline device
[43185.061274] sd 2:0:0:0: rejecting I/O to offline device
[43185.061280] sd 2:0:0:0: [sdd] Asking for cache data failed
[43185.061283] sd 2:0:0:0: [sdd] Assuming drive cache: write through
[43185.061289] sdd: detected capacity change from 2000398934016 to 0
[43185.061610] ata3.00: detaching (SCSI 2:0:0:0)
[43185.062444] sd 2:0:0:0: [sdd] Stopping disk
[43249.120042] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[43249.120046] ata4.00: failed command: FLUSH CACHE EXT
[43249.120051] ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[43249.120052] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[43249.120054] ata4.00: status: { DRDY }
[43249.120059] ata4: hard resetting link
[43249.120060] ata4: nv: skipping hardreset on occupied port
[43249.610042] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[43249.650323] ata4.00: configured for UDMA/133
[43249.650326] ata4.00: retrying FLUSH 0xea Emask 0x4
[43249.650452] ata4.00: device reported invalid CHS sector 0
[43249.650458] ata4: EH complete
答案1
您确实有一个问题。我认为(如果我理解正确的话)确定导致此故障的原因的过程是什么?
我是一名网络安全工程师。所以请理解我打字时的紧张心情。排除加密问题。解密驱动器,看看是否仍有问题。缺点是您需要在解密状态下使用它们几个月。
电缆是一个简单的测试(你应该先从那里开始)。更换它们,但我很难相信这是问题所在,除非你的机箱里有霓虹灯。
剩下的就是主板了。如果不是其他两个……
如果有人不同意我的故障排除方法,我敢肯定他们会附和的。更换电缆并不昂贵,而暂时禁用加密是一种安全风险,只有您才能确定您是否愿意接受。
答案2
看来您的 SATA 链路上有很多错误。因此,主机无法通过链路可靠地获取命令,即使获取命令,有时返回的数据也会损坏。
您会在消息中看到速度受限或未收到预期的驱动器标识符。您还会看到来自驱动程序不同层的令人困惑的消息,这些消息不一定反映 SATA 硬件级别上正在发生的事情。例如,“将速度限制为 UDMA/133:PIO3”严格仅适用于并行 ATA 驱动器(它只是意味着驱动程序正在尝试较慢的接口速度以查看错误是否清除),但错误消息清楚地表明,实际处理硬件的最低级别知道它正在与 SATA 驱动器对话。
您认为可能是 SATA 电缆的问题,这个想法很正确。尝试更换它们,并确保它们的额定速度为 SATA 3.0 Gb/秒(也称为“SATA 2”或“SATA II”)。我不认为您的驱动器是问题所在。为什么更换驱动器后几个月才会出现错误?也许电缆不知何故松动了,更换驱动器会重新安装它们。或者也许这只是偶然的。