SATA 磁盘有故障但出现周期性错误?

SATA 磁盘有故障但出现周期性错误?

我有一块 Seagate St2000dm001 2TB Barracuda Sata3 磁盘,它出现了类似这样的错误:

[Tue Jun 14 10:02:06 2022] ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
[Tue Jun 14 10:02:06 2022] ata2.00: failed command: WRITE FPDMA QUEUED
[Tue Jun 14 10:02:06 2022] ata2.00: cmd 61/00:00:00:48:9f/02:00:b2:00:00/40 tag 0 ncq 262144 out
[Tue Jun 14 10:02:06 2022]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[Tue Jun 14 10:02:06 2022] ata2.00: status: { DRDY }
[Tue Jun 14 10:02:06 2022] ata2: hard resetting link
[Tue Jun 14 10:02:16 2022] ata2: softreset failed (1st FIS failed)
[Tue Jun 14 10:02:16 2022] ata2: hard resetting link
[Tue Jun 14 10:02:26 2022] ata2: softreset failed (1st FIS failed)
[Tue Jun 14 10:02:26 2022] ata2: hard resetting link
[Tue Jun 14 10:02:42 2022] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[Tue Jun 14 10:02:42 2022] ata2.00: configured for UDMA/133
[Tue Jun 14 10:02:42 2022] ata2.00: device reported invalid CHS sector 0
[Tue Jun 14 10:02:42 2022] ata2: EH complete

我使用不同的电缆和不同的机器测试了磁盘,错误仍然存​​在。这看起来像是一个明显的磁盘损坏案例,但有一个转折点。在执行非常长的 时mkfs.ext4 -c -c,对错误进行 greping,可以得到错误的周期性模式:

[Mon Jun 13 10:47:02 2022] ata2.00: failed command: WRITE FPDMA QUEUED
[Mon Jun 13 11:51:08 2022] ata2.00: failed command: WRITE FPDMA QUEUED
[Mon Jun 13 12:55:14 2022] ata2.00: failed command: WRITE FPDMA QUEUED
[Mon Jun 13 14:01:21 2022] ata2.00: failed command: READ FPDMA QUEUED
[Mon Jun 13 15:08:27 2022] ata2.00: failed command: READ FPDMA QUEUED
[Mon Jun 13 16:15:33 2022] ata2.00: failed command: READ FPDMA QUEUED
[Mon Jun 13 17:22:39 2022] ata2.00: failed command: WRITE FPDMA QUEUED
[Mon Jun 13 18:29:43 2022] ata2.00: failed command: WRITE FPDMA QUEUED
[Mon Jun 13 19:36:49 2022] ata2.00: failed command: WRITE FPDMA QUEUED
[Mon Jun 13 20:43:55 2022] ata2.00: failed command: WRITE FPDMA QUEUED
[Mon Jun 13 21:50:02 2022] ata2.00: failed command: READ FPDMA QUEUED
[Mon Jun 13 22:57:08 2022] ata2.00: failed command: READ FPDMA QUEUED
[Tue Jun 14 00:04:14 2022] ata2.00: failed command: READ FPDMA QUEUED
[Tue Jun 14 01:11:17 2022] ata2.00: failed command: WRITE FPDMA QUEUED
[Tue Jun 14 02:15:24 2022] ata2.00: failed command: WRITE FPDMA QUEUED
[Tue Jun 14 03:19:30 2022] ata2.00: failed command: WRITE FPDMA QUEUED
[Tue Jun 14 04:26:36 2022] ata2.00: failed command: READ FPDMA QUEUED
[Tue Jun 14 05:33:42 2022] ata2.00: failed command: READ FPDMA QUEUED
[Tue Jun 14 06:40:48 2022] ata2.00: failed command: READ FPDMA QUEUED
[Tue Jun 14 07:47:54 2022] ata2.00: failed command: WRITE FPDMA QUEUED
[Tue Jun 14 08:55:00 2022] ata2.00: failed command: WRITE FPDMA QUEUED
[Tue Jun 14 10:02:06 2022] ata2.00: failed command: WRITE FPDMA QUEUED

它几乎每 1 小时 7 分钟准确一次。我认为它可能与有关smartd,但smartd没有运行。所以,我很困惑:什么样的硬件故障会导致周期为 1 小时 7 分钟的周期性错误?任何想法都将不胜感激。

此致,

尼古拉斯

答案1

这几乎正​​好是 4000 秒,在廉价振荡器的精度范围内。

这意味着 SATA 驱动器或 SATA 控制器固件中的某些内容可能会自动执行此操作。

基本上,原因可能是任何事情。例如,当某些组件检查子程序失败时,驱动器固件每 4000 秒重置一次。当 SATA 控制器固件尝试重新协商链接并失败时,它每 4000 秒重置一次,或者其他任何原因(这两个例子不太可能是其他原因)。

时间安排表明的唯一一件事是软件决定这样做,无论是作为操作系统、控制器还是驱动器固件运行的软件。这可能是一个软件错误,也可能是硬件错误的真实检测。

所以,很难诊断这一点。如果控制器和驱动器已经是最新的固件版本(fwupdmgr get-updates对两者都是您的好友),那么就好了。

相关内容