我有一个 SATA 硬盘,通过 Intel 服务器的背板连接到 SAS 卡。该硬盘在 Linux 中似乎很容易访问,但我注意到日志中有一些奇怪的错误。我想看看这些错误是否与启动/初始化问题或其他问题有关,所以我想做一个 SMART 测试。
设备报告“整体健康自我评估测试结果:通过”,但我想自己运行一些 SMART 测试。我不确定为什么会失败,我的 Google-foo 让我失望了。
有人能解释一下以下内容的含义以及我是否可以解决这个问题 - 最好不要使驱动器脱机:
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in captive mode".
Command "Execute SMART Short self-test routine immediately in captive mode" failed: Connection timed out
(这是对命令“smartctl -t short -C /dev/sdd”的响应)
答案1
“captive”模式似乎不受支持(至少在 Linux 上?),遗憾的是我查看的任何地方都没有提到这一点。
所以我遇到了同样的问题,以为“强制”前台测试将具有完全优先级和可用带宽,因此完成得更快。但事实似乎并非如此。所以smartctl
手册页具有误导性。
作为强制自检的一部分,smartctl 进程会一直等待驱动器完成并返回。但是,SATA 子系统会将此未完成的命令检测为驱动器挂起,并在/sys/block/<blockdev>/device/timeout
几秒后中止。
dmesg
将记录驱动器重置(在我的例子中,它挂在 Adaptec 控制器上),
[May 7 17:28] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (0,1,3,0):
[ +28.668009] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (0,1,3,0):
[ +0.024081] aacraid: Host bus reset request. SCSI hang ?
[ +0.000006] aacraid 0000:06:00.0: outstanding cmd: midlevel-0
[ +0.000002] aacraid 0000:06:00.0: outstanding cmd: lowlevel-0
[ +0.000001] aacraid 0000:06:00.0: outstanding cmd: error handler-1
[ +0.000001] aacraid 0000:06:00.0: outstanding cmd: firmware-0
[ +0.000001] aacraid 0000:06:00.0: outstanding cmd: kernel-0
[ +0.019997] aacraid 0000:06:00.0: Controller reset type is 3
[ +0.000004] aacraid 0000:06:00.0: Issuing IOP reset
[May 7 17:29] aacraid 0000:06:00.0: IOP reset succeeded
[ +0.033805] aacraid: Comm Interface type2 enabled
[ +2.217498] udevd[558]: worker [9103] /devices/pci0000:00/0000:00:0c.0/0000:06:00.0/host0/target0:1:3/0:1:3:0/block/sdd is taking a long time
[ +6.814903] aacraid 0000:06:00.0: Scheduling bus rescan
[ +10.192816] sd 0:1:3:0: [sdd] tag#543 timing out command, waited 60s
[ +0.000007] sd 0:1:3:0: [sdd] tag#543 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK cmd_age=109s
[ +0.000003] sd 0:1:3:0: [sdd] tag#543 CDB: ATA command pass through(16) 85 06 0c 00 d4 00 00 00 81 00 4f 00 c2 00 b0 00
[ +0.001052] sd 0:1:3:0: [sdd] 11721045168 512-byte logical blocks: (6.00 TB/5.46 TiB)
[ +0.000005] sd 0:1:3:0: [sdd] 4096-byte physical blocks
[ +0.003122] sdd: sdd1 sdd2
然后驱动器记录失败的自检:
SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short captive Interrupted (host reset) 50% 196 -
描述该问题的票据smartmontools
已被标记为“wontfix”:https://www.smartmontools.org/ticket/1153
我认为增加块设备超时不是延长自检时间的解决方案。所以我想我们无法运行强制测试。(对于原生 SCSI 驱动器来说可能有所不同?)