在我的系统日志中我收到大量信息:
Mar 1 11:20:07 franklin kernel: [72947.878407] Waking error handler thread
Mar 1 11:20:07 franklin kernel: [72947.878415] Error handler scsi_eh_1 waking up
Mar 1 11:20:07 franklin kernel: [72947.878834] scsi_eh_1: flush finish cmd: ffff8806d5568980
Mar 1 11:20:07 franklin kernel: [72947.878871] scsi_restart_operations: waking up host to restart
Mar 1 11:20:07 franklin kernel: [72947.878888] Error handler scsi_eh_1 sleeping
Mar 1 11:20:07 franklin kernel: [72947.878922] scsi_block_when_processing_errors: rtn: 1
Mar 1 11:20:07 franklin kernel: [72947.883450] Waking error handler thread
Mar 1 11:20:07 franklin kernel: [72947.883462] Error handler scsi_eh_1 waking up
Mar 1 11:20:07 franklin kernel: [72947.883887] scsi_eh_1: flush finish cmd: ffff8806d57c0280
Mar 1 11:20:07 franklin kernel: [72947.883927] scsi_restart_operations: waking up host to restart
Mar 1 11:20:07 franklin kernel: [72947.883965] scsi_block_when_processing_errors: rtn: 1
Mar 1 11:20:07 franklin kernel: [72947.883979] Error handler scsi_eh_1 sleeping
Mar 1 11:20:07 franklin kernel: [72947.889556] Waking error handler thread
Mar 1 11:20:07 franklin kernel: [72947.889569] Error handler scsi_eh_1 waking up
Mar 1 11:20:07 franklin kernel: [72947.890015] scsi_eh_1: flush finish cmd: ffff8806d57c0280
Mar 1 11:20:07 franklin kernel: [72947.890052] scsi_restart_operations: waking up host to restart
Mar 1 11:20:07 franklin kernel: [72947.890070] Error handler scsi_eh_1 sleeping
Mar 1 11:20:07 franklin kernel: [72948.569299] mpt2sas1: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Mar 1 11:20:07 franklin kernel: [72948.569312] mpt2sas1: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Mar 1 11:20:07 franklin kernel: [72948.569323] mpt2sas1: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Mar 1 11:20:07 franklin kernel: [72948.569332] mpt2sas1: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Mar 1 11:20:07 franklin kernel: [72948.569342] mpt2sas1: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Mar 1 11:20:07 franklin kernel: [72948.569351] mpt2sas1: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Mar 1 11:20:07 franklin kernel: [72948.569360] mpt2sas1: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Mar 1 11:20:07 franklin kernel: [72948.569370] mpt2sas1: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Mar 1 11:20:07 franklin kernel: [72948.569379] mpt2sas1: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
我已使用以下方法启用额外日志记录:
scsiloglev -w -e 7 -t 7 -s 7 -ml 0 -mc 0 -ll 7 -lc 7 -hl 0 -hc 0 -i 0
我已经调整了 SCSI 超时:
parallel echo 300 '>' {} ::: /sys/block/sd*[a-z]/device/timeout
并将 TLER 设置为 7 秒:
parallel smartctl -l scterc,70,70 {} ::: /dev/sd*[a-z]
我已将控制器换成相同的控制器 (SAS2008),重新安装所有电缆,交换外部 SAS 电缆,重新安装所有磁盘。使用“dd”单独读取磁盘没有问题,但在 RAID6 中使用时,磁盘经常脱机。
# uname -a
Linux franklin 3.2.0-0.bpo.4-amd64 #1 SMP Debian 3.2.35-2~bpo60+1 x86_64 GNU/Linux
在 LKML 上发帖之前我还应该尝试什么吗?
答案1
这些带有此 log_info 的 mpt2sas 日志消息通常表示 SAS 网络路径上存在问题。即电缆或连接器损坏。如果您有不同的电缆或不同的硬盘盒要测试,甚至有备用磁盘,那将是一个好主意。我有时会看到这些错误来自坏磁盘。您可以尝试通过查看 /sys/class/sas_phy/ 层次结构中的 invalid_dword 文件并将受影响的 phy 映射到组件来查明原因。请注意,错误将显示在接收端,因此受影响的部分将是另一侧或它们之间的电缆。