即使硬盘状况良好,mdadm 也会将其标记为有故障?

即使硬盘状况良好,mdadm 也会将其标记为有故障?

我已将自定义 NAS 配置为在空闲 20 分钟后关闭驱动器的旋转。

刚才我检查/proc/mdstat发现一个驱动器被标记为故障,但 SMART 显示该驱动器状况良好。因此我怀疑 md-raid 认为启动时间太长,并将该驱动器标记为故障。

重新添加和重建似乎也不是什么问题。

dmesg显示以下有趣的线条,我在谷歌搜索中找不到太多内容。

[97144.228682] sd 0:0:2:0: attempting task abort! scmd(ffff97f7b14ce948)
[97144.228688] sd 0:0:2:0: [sdc] tag#0 CDB: opcode=0x12 12 00 00 00 24 00
[97144.228692] scsi target0:0:2: handle(0x000c), sas_address(0x5001438020b9ee12), phy(18)
[97144.228694] scsi target0:0:2: enclosure_logical_id(0x5001438020b9ee25), slot(49)
[97148.184253] sd 0:0:2:0: task abort: SUCCESS scmd(ffff97f7b14ce948)
[97148.235864] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
--- last message repeated a couple dozen times ---
[97148.490304] sd 0:0:2:0: [sdc] tag#16 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490308] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490310] sd 0:0:2:0: [sdc] tag#13 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490315] sd 0:0:2:0: [sdc] tag#13 CDB: opcode=0x88 88 00 00 00 00 00 0d 6e af f0 00 00 00 10 00 00
[97148.490317] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490321] print_req_error: I/O error, dev sdc, sector 225357808
[97148.490326] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490331] sd 0:0:2:0: [sdc] tag#16 CDB: opcode=0x88 88 00 00 00 00 00 0d 6e b0 18 00 00 00 20 00 00
[97148.490334] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490337] print_req_error: I/O error, dev sdc, sector 225357848
[97148.490341] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490354] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490358] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490366] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490370] sd 0:0:2:0: [sdc] tag#15 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490374] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490378] sd 0:0:2:0: [sdc] tag#15 CDB: opcode=0x88 88 00 00 00 00 00 0d 6e ae 68 00 00 00 08 00 00
[97148.490380] print_req_error: I/O error, dev sdc, sector 225357416
[97148.490383] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490392] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490399] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490403] sd 0:0:2:0: [sdc] tag#14 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490407] sd 0:0:2:0: [sdc] tag#14 CDB: opcode=0x88 88 00 00 00 00 00 0d 6e ad 90 00 00 00 30 00 00
[97148.490409] print_req_error: I/O error, dev sdc, sector 225357200
[97148.490435] sd 0:0:2:0: [sdc] tag#11 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490439] sd 0:0:2:0: [sdc] tag#11 CDB: opcode=0x88 88 00 00 00 00 00 0d 6e ad c8 00 00 00 58 00 00
[97148.490441] print_req_error: I/O error, dev sdc, sector 225357256
[97148.490450] sd 0:0:2:0: [sdc] tag#10 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490454] sd 0:0:2:0: [sdc] tag#10 CDB: opcode=0x88 88 00 00 00 00 00 0d 6e ad 00 00 00 00 50 00 00
[97148.490456] print_req_error: I/O error, dev sdc, sector 225357056
[97148.490464] sd 0:0:2:0: [sdc] tag#9 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490468] sd 0:0:2:0: [sdc] tag#9 CDB: opcode=0x35 35 00 00 00 00 00 00 00 00 00
[97148.490472] print_req_error: I/O error, dev sdc, sector 16
[97148.490474] md: super_written gets error=10
[97148.490477] md/raid:md0: Disk failure on sdc, disabling device.
               md/raid:md0: Operation continuing on 3 devices.
[97148.490496] sd 0:0:2:0: [sdc] tag#8 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490500] sd 0:0:2:0: [sdc] tag#8 CDB: opcode=0x88 88 00 00 00 00 00 0d 6e b0 40 00 00 00 20 00 00
[97148.490502] print_req_error: I/O error, dev sdc, sector 225357888
[97148.490510] sd 0:0:2:0: [sdc] tag#7 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490514] sd 0:0:2:0: [sdc] tag#7 CDB: opcode=0x88 88 00 00 00 00 00 0d 6e af b8 00 00 00 30 00 00
[97148.490516] print_req_error: I/O error, dev sdc, sector 225357752
[97148.490524] sd 0:0:2:0: [sdc] tag#6 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490528] sd 0:0:2:0: [sdc] tag#6 CDB: opcode=0x88 88 00 00 00 00 00 0d 6e b0 00 00 00 00 08 00 00
[97148.490530] print_req_error: I/O error, dev sdc, sector 225357824

是否可以增加超时值,让 md-raid 等待几分钟,让驱动器上线?
还有其他方法可以防止将来出现这种情况吗(除了让我的驱动器 24/7 不停旋转,因为我还想时不时地休眠)?


更新 2017-10-07

更新控制器固件(它是 Perc H310 交叉刷新到 9211-8i IT 模式)、更新 SAS 扩展器固件和增加超时似乎已经大大降低了上述错误的频率,但它们仍然会发生,并且在某些情况下 md-raid 仍然会导致驱动器故障。

我已经解码了 SAS 错误代码:

Value           31110101h
Type:           30000000h       SAS
Origin:         01000000h       PL
Code:           00110000h       PL_LOGINFO_CODE_RESET See Sub-Codes below (PL_LOGINFO_SUB_CODE)
Sub Code:       00000100h       PL_LOGINFO_SUB_CODE_OPEN_FAILURE
SubSub Code:    00000001h       PL_LOGINFO_SUB_CODE_OPEN_FAILURE_NO_DEST_TIMEOUT

我在网上找不到任何关于它的描述,只找到了一个简短的描述(在 2009 年的 LSI pdf 中):

无法打开连接,错误为“打开拒绝(无目标)”。重试 50 毫秒。

hdparm -y ...经过进一步的测试(使用简单命令使驱动器旋转减速并启动以引发问题hddtemp ...)后,我发现超时时间略高于 11 秒,这很奇怪,因为唯一剩下的超时设置值 10 是“顺序”、“可移动”和“未知”设备的通用 I/O 超时。


更新 2017-10-08

这是我的设置的拓扑结构:

Dell Perc H310 (LSISAS2008: FWVersion(20.00.07.00), ChipRevision(0x03), BiosVersion(07.39.02.00)) (flashed to 9211-8i IT-mode)
    `- HP SAS Expander card (FW 2.10)
        |- Hitachi HDS72404 } md0
        |- Hitachi HDS72404 } md0
        |- HGST HDN724040AL } md0
        |- HGST HDN724040AL } md0
        |- ST8000AS0002-1NA (btrfs)
        |- ST8000AS0002-1NA (btrfs)
        `- ST8000AS0002-1NA (xfs)

四个 Hitachi/HGST 硬盘组成了 md-raid 阵列,Seagate 硬盘与 md-raid 无关,但也受到根问题的影响(但 btrfs 似乎并不太在意)。

经过几个小时的研究和实验,我到目前为止所做的工作并没有太大帮助:

在启动时运行以下代码,增加一些mpt2sas超时:

for f in /sys/block/sd?/device/timeout; do
        echo 90 > "$f"
done

for f in /sys/block/sd?/device/eh_timeout; do
        echo 90 > "$f"
done

for f in /sys/class/scsi_disk/*/manage_start_stop; do
        echo 1 > "$f"
done

我已经更新了我的 HBA 和扩展器固件。

我已将 HBA BIOS 配置实用程序中的所有超时设置为 90 秒。

但是,在 11 到 12 秒后硬盘从待机状态唤醒(启动)时,超时仍然会相当可预测地发生。(我怀疑超时时间为 10 秒,因为这是许多超时的默认设置,并带有一些额外的延迟。)


更新 2017-10-10

我现在已经编写了一个脚本,可以持续扫描dmesg丢失的 md 设备并自动发出恢复命令mdadm --manage /dev/md0 --re-add /dev/sdx。使用写入意图位图,恢复现在只需几秒钟,而不是一天。但这不可能是解决这个问题的正确方法。

我也刚刚写信给 Broadcom,也许他们能够提供帮助。


更新 2017-10-11

我正在调试内核以查找可能存在的问题:

--drive put to standby with hdparm -y--
18:16:35 sd 0:0:1:0: [sdb] sd_open
18:16:35 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:35 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:35 sd 0:0:1:0: [sdb] tag#0 Send: scmd 0xffff989bc94ea548
18:16:35 sd 0:0:1:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e0 00
18:16:35 SCSI DEBUG: scsi_check_sense() scsi_check_sense 442 
18:16:35 SCSI DEBUG: scsi_check_sense() continuing default behaviour past line 484 
18:16:35 sd 0:0:1:0: [sdb] tag#0 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
18:16:35 sd 0:0:1:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e0 00
18:16:35 sd 0:0:1:0: [sdb] tag#0 Sense Key : Recovered Error [current] [descriptor] 
18:16:35 sd 0:0:1:0: [sdb] tag#0 Add. Sense: ATA pass through information available
18:16:35 sd 0:0:1:0: [sdb] tag#0 scsi host busy 1 failed 0
18:16:35 sd 0:0:1:0: Notifying upper driver of completion (result 8000002)
18:16:35 sd 0:0:1:0: [sdb] sd_release
18:16:35 sd 0:0:1:0: [sdb] sd_check_events
18:16:35 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:35 sd 0:0:1:0: tag#0 Send: scmd 0xffff989bc866e148
18:16:35 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:35 SCSI DEBUG: scsi_check_sense() scsi_check_sense 442 
18:16:35 SCSI DEBUG: scsi_check_sense()=>SUCCESS [nasty midlayer TURs] 
18:16:35 sd 0:0:1:0: tag#0 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
18:16:35 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:35 sd 0:0:1:0: tag#0 Sense Key : Unit Attention [current] 
18:16:35 sd 0:0:1:0: tag#0 Add. Sense: Power on, reset, or bus device reset occurred
18:16:35 sd 0:0:1:0: tag#0 scsi host busy 1 failed 0
18:16:35 sd 0:0:1:0: Notifying upper driver of completion (result 8000002)
18:16:35 sd 0:0:1:0: tag#0 Send: scmd 0xffff989bc866e148
18:16:35 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:35 SCSI DEBUG: scsi_check_sense() scsi_check_sense 442 
18:16:35 SCSI DEBUG: scsi_check_sense()=>SUCCESS [nasty midlayer TURs] 
18:16:35 sd 0:0:1:0: tag#0 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
18:16:35 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:35 sd 0:0:1:0: tag#0 Sense Key : Not Ready [current] 
18:16:35 sd 0:0:1:0: tag#0 Add. Sense: Logical unit not ready, initializing command required
18:16:35 sd 0:0:1:0: tag#0 scsi host busy 1 failed 0
18:16:35 sd 0:0:1:0: Notifying upper driver of completion (result 8000002)
--command executed on drive with hddtemp--
18:16:45 sd 0:0:1:0: [sdb] sd_open
18:16:45 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:45 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:45 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:45 sd 0:0:1:0: [sdb] tag#0 Send: scmd 0xffff989bc8669548
18:16:45 sd 0:0:1:0: [sdb] tag#0 CDB: Inquiry 12 00 00 00 24 00
18:16:45 sd 0:0:1:0: [sdb] tag#0 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
18:16:45 sd 0:0:1:0: [sdb] tag#0 CDB: Inquiry 12 00 00 00 24 00
18:16:45 sd 0:0:1:0: [sdb] tag#0 scsi host busy 1 failed 0
18:16:45 sd 0:0:1:0: Notifying upper driver of completion (result 0)
18:16:45 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:45 sd 0:0:1:0: [sdb] tag#0 Send: scmd 0xffff989bc8669548
18:16:45 sd 0:0:1:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 08 2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
18:16:45 SCSI DEBUG: scsi_check_sense() scsi_check_sense 442 
18:16:45 SCSI DEBUG: scsi_check_sense() continuing default behaviour past line 484 
18:16:45 sd 0:0:1:0: [sdb] tag#0 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
18:16:45 sd 0:0:1:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 08 2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
18:16:45 sd 0:0:1:0: [sdb] tag#0 Sense Key : Recovered Error [current] [descriptor] 
18:16:45 sd 0:0:1:0: [sdb] tag#0 Add. Sense: ATA pass through information available
18:16:45 sd 0:0:1:0: [sdb] tag#0 scsi host busy 1 failed 0
18:16:45 sd 0:0:1:0: Notifying upper driver of completion (result 8000002)
18:16:45 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:45 sd 0:0:1:0: [sdb] tag#0 Send: scmd 0xffff989bc8669548
18:16:45 sd 0:0:1:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 08 2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
18:16:45 SCSI DEBUG: scsi_check_sense() scsi_check_sense 442 
18:16:45 SCSI DEBUG: scsi_check_sense() continuing default behaviour past line 484 
18:16:45 sd 0:0:1:0: [sdb] tag#0 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
18:16:45 sd 0:0:1:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 08 2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
18:16:45 sd 0:0:1:0: [sdb] tag#0 Sense Key : Recovered Error [current] [descriptor] 
18:16:45 sd 0:0:1:0: [sdb] tag#0 Add. Sense: ATA pass through information available
18:16:45 sd 0:0:1:0: [sdb] tag#0 scsi host busy 1 failed 0
18:16:45 sd 0:0:1:0: Notifying upper driver of completion (result 8000002)
18:16:45 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:45 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:45 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:45 sd 0:0:1:0: [sdb] tag#0 Send: scmd 0xffff989bc8669548
18:16:45 sd 0:0:1:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 06 20 00 d8 00 00 00 00 00 4f 00 c2 00 b0 00
18:16:53 sd 0:0:1:0: [sdb] tag#0 Done: TIMEOUT_ERROR Result: hostbyte=DID_OK driverbyte=DRIVER_OK
18:16:53 sd 0:0:1:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 06 20 00 d8 00 00 00 00 00 4f 00 c2 00 b0 00
18:16:53 sd 0:0:1:0: [sdb] tag#0 scsi host busy 1 failed 0
18:16:53 sd 0:0:1:0: [sdb] tag#0 abort scheduled
18:16:53 sd 0:0:1:0: [sdb] tag#0 aborting command
18:16:53 sd 0:0:1:0: attempting task abort! scmd(ffff989bc8669548)
18:16:53 sd 0:0:1:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 06 20 00 d8 00 00 00 00 00 4f 00 c2 00 b0 00
18:16:53 scsi target0:0:1: handle(0x000a), sas_address(0x5001438020b9ee10), phy(16)
18:16:53 scsi target0:0:1: enclosure_logical_id(0x5001438020b9ee25), slot(51)
18:16:57 sd 0:0:1:0: task abort: SUCCESS scmd(ffff989bc8669548)
18:16:57 sd 0:0:1:0: [sdb] tag#0 finish aborted command
18:16:57 sd 0:0:1:0: Notifying upper driver of completion (result 30000)
18:16:57 sd 0:0:1:0: [sdb] sd_release
18:16:57 sd 0:0:1:0: [sdb] sd_check_events
18:16:57 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:57 sd 0:0:1:0: tag#0 Send: scmd 0xffff989bd1de9148
18:16:57 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:57 mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
18:16:57 sd 0:0:1:0: tag#0 Done: NEEDS_RETRY Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
18:16:57 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:57 sd 0:0:1:0: tag#0 scsi host busy 1 failed 0
18:16:57 sd 0:0:1:0: tag#0 Inserting command ffff989bd1de9148 into mlqueue
18:16:57 sd 0:0:1:0: unblocking device at zero depth
18:16:57 sd 0:0:1:0: tag#0 Send: scmd 0xffff989bd1de9148
18:16:58 mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
18:16:57 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 sd 0:0:1:0: tag#0 Done: NEEDS_RETRY Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
18:16:58 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 sd 0:0:1:0: tag#0 scsi host busy 1 failed 0
18:16:58 sd 0:0:1:0: tag#0 Inserting command ffff989bd1de9148 into mlqueue
18:16:58 sd 0:0:1:0: unblocking device at zero depth
18:16:58 sd 0:0:1:0: tag#0 Send: scmd 0xffff989bd1de9148
18:16:58 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
18:16:58 sd 0:0:1:0: tag#0 Done: NEEDS_RETRY Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
18:16:58 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 sd 0:0:1:0: tag#0 scsi host busy 1 failed 0
18:16:58 sd 0:0:1:0: tag#0 Inserting command ffff989bd1de9148 into mlqueue
18:16:58 sd 0:0:1:0: unblocking device at zero depth
18:16:58 sd 0:0:1:0: tag#0 Send: scmd 0xffff989bd1de9148
18:16:58 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
18:16:58 sd 0:0:1:0: tag#0 Done: NEEDS_RETRY Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
18:16:58 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 sd 0:0:1:0: tag#0 scsi host busy 1 failed 0
18:16:58 sd 0:0:1:0: tag#0 Inserting command ffff989bd1de9148 into mlqueue
18:16:58 sd 0:0:1:0: unblocking device at zero depth
18:16:58 sd 0:0:1:0: tag#0 Send: scmd 0xffff989bd1de9148
18:16:58 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
18:16:58 sd 0:0:1:0: tag#0 Done: NEEDS_RETRY Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
18:16:58 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 sd 0:0:1:0: tag#0 scsi host busy 1 failed 0
18:16:58 sd 0:0:1:0: tag#0 Inserting command ffff989bd1de9148 into mlqueue
18:16:58 sd 0:0:1:0: unblocking device at zero depth
18:16:58 sd 0:0:1:0: tag#0 Send: scmd 0xffff989bd1de9148
18:16:58 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
18:16:58 sd 0:0:1:0: tag#0 Done: SUCCESS Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
18:16:58 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 sd 0:0:1:0: tag#0 scsi host busy 1 failed 0
18:16:58 sd 0:0:1:0: Notifying upper driver of completion (result b0000)
18:16:58 sd 0:0:1:0: device_block, handle(0x000a)
18:16:59 sd 0:0:1:0: device_unblock and setting to running, handle(0x000a)

我觉得特别令人担忧的是

18:16:53 sd 0:0:1:0: [sdb] tag#0 Done: TIMEOUT_ERROR Result: hostbyte=DID_OK driverbyte=DRIVER_OK

直接导致

18:16:53 sd 0:0:1:0: [sdb] tag#0 abort scheduled
18:16:53 sd 0:0:1:0: [sdb] tag#0 aborting command

我想知道这个超时在哪里定义以及如何更改它。


更新 2017-10-13

通过调试我在实践中遇到了以下超时:

  • 7秒
  • 15秒
  • 20 秒
  • 90 年代(如 所述/sys/block/sd?/device/timeout
  • 180 秒(似乎是之前设置的两倍)

内核源代码中定义了额外的超时:

./include/linux/blkdev.h

#define BLK_DEFAULT_SG_TIMEOUT  (60 * HZ)
#define BLK_MIN_SG_TIMEOUT  (7 * HZ)

./include/scsi/scsi.h

#define FORMAT_UNIT_TIMEOUT     (2 * 60 * 60 * HZ)
#define START_STOP_TIMEOUT      (60 * HZ)
#define MOVE_MEDIUM_TIMEOUT     (5 * 60 * HZ)
#define READ_ELEMENT_STATUS_TIMEOUT (5 * 60 * HZ)
#define READ_DEFECT_DATA_TIMEOUT    (60 * HZ )

这些得到应用于./block/scsi_ioctl.c函数sg_scsi_ioctl(...)blk_fill_sghdr_rq(...)

这解释了短暂的 7 秒超时从何而来(BLK_MIN_SG_TIMEOUT)。

15 秒和 20 秒的超时似乎来自sg_io_hdr*->timeoutblk_fill_sghdr_rq(...)我找不到它之前在哪里设置。

答案1

肯定是驱动器确实有故障。

你在超时/旋转中寻找一个复杂的答案,而现实是

[97148.490321] print_req_error:I/O 错误,dev sdc,扇区 225357808

控制器无法读取或写入驱动器中的特定扇区。在旋转过程中,缓存通常会接受写入。

通常,这种情况只会在真正有故障的驱动器上出现,无论 smartctl 说了什么。

更换驱动器会有什么不同吗?

相关内容