系统描述:
CentOS 7。SuperMicro
X9DRW-iF 主板。
最新 BIOS(和 SCU 选项 ROM),使用 SCU 选项 ROM。
四个 300GB WD HDD 连接到主板的 SCU。
六个 HDD 连接到六个单独的 SATA 端口。
我们业务的软件正常运行。
我设置了这个系统,两周内有两次,我发现连接到 SCU 的所有四个驱动器都“消失了”,这意味着 ls -l 它们的挂载点显示
ls: cannot access disk1: Input/output error
ls: cannot access disk2: Input/output error
ls: cannot access disk3: Input/output error
ls: cannot access disk4: Input/output error
我从 dmesg 看到 isci 正在放弃命令,尝试并无法重置设备等(dmesg 输出在底部。)
从RedHat 的控制 SCSI 命令计时器和设备状态页面,我认为当 SCSI 命令超时且设备无法重置时,这是“预期的”错误处理事件序列。但是,我不确定是总线还是主机被重置了。但它会影响 SCU 上的所有四个驱动器。
我读这个 ServerFault 问题那我认为将此问题标识为“循环重置”,但我无法找到有关循环重置的更多信息。
我已验证这四个驱动器的 SCSI 命令超时为 30 秒,这是 RHEL/CentOS 7 的默认超时时间。(cat /sys/block/sda/device/timeout)将超时时间降低到 1 秒或 0 秒可能会很快导致一驱动器被禁用,但我很少甚至从未遇到过由于一个驱动器故障而导致 SCU 上的所有四个驱动器被禁用的情况。
以下是我的一些问题。最终目标是配置驱动器,以便不会发生这种循环重置或“删除所有 SCU 驱动器”。
- 是什么原因导致 SCSI 命令耗时超过 30 秒?这似乎太长了。
- 什么会导致 SCU 上的所有四个驱动器都受到影响?根据 RedHat 文章,这意味着设备重置失败,并且错误处理程序继续重置总线/主机。什么会导致驱动器重置失败?
- 什么样的事情可能会使这种情况很少发生,大约每周一次,就像我观察到的那样?
- SCSI 命令超时是否可以接受,还是可以通过增加超时时间来避免?对我来说最重要的是保持驱动器在线;等待驱动器响应一段时间并不是什么大问题。
- 我是否正确地将其诊断为对 SCSI 命令超时的“预期”响应?(即,这是 SCSI 命令超时吗?)
- 什么会导致 SCU 上的驱动器比连接到各个 SATA 端口的驱动器响应更快/更慢?我没有遇到过问题。
- 命令排队与此有什么关系吗?所有四个驱动器的队列深度均为 31。
- SCU 上驱动器的 SCSI 命令超时建议值是多少?
- 最终,我该如何重新配置以避免此问题?
感谢大家的帮助。第一次故障的 Dmesg 输出(为简洁起见进行了编辑)复制如下。请注意,驱动器是 sda、b、c、d 和 ata7、8、9、10。
17:40:41: isci 0000:08:00.0: isci_task_abort_task: dev = ffff881fce964490 (STP/SATA), task = ffff881fcfc95900, old_request == ffff881fcea13000
17:40:51: isci 0000:08:00.0: isci_remote_device_terminate_requests host0 timeout single
17:41:01: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce964490
17:41:01: isci 0000:08:00.0: isci_task_abort_task: SATA/STP request or complete_in_target (0), or IDEV_GONE (0), thus no TMF
17:41:01: isci 0000:08:00.0: isci_task_abort_task: Done; dev = ffff881fce964490, task = ffff881fcfc95900 , old_request == ffff881fcea13000
17:41:01: isci 0000:08:00.0: isci_task_abort_task: dev = ffff881fce964310 (STP/SATA), task = ffff881fd0ce97c0, old_request == ffff881fcf125000
17:41:11: isci 0000:08:00.0: isci_remote_device_terminate_requests host0 timeout single
17:41:21: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce964310
17:41:21: isci 0000:08:00.0: isci_task_abort_task: SATA/STP request or complete_in_target (0), or IDEV_GONE (0), thus no TMF
17:41:21: isci 0000:08:00.0: isci_task_abort_task: Done; dev = ffff881fce964310, task = ffff881fd0ce97c0 , old_request == ffff881fcf125000
17:43:50: INFO: task app:20227 blocked for more than 120 seconds.
17:43:50: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
17:43:50: app D ffff881fffcd3680 0 20227 13315 0x00000080
17:43:50: ffff883fcb15f8e0 0000000000000082 ffff883fcb15ffd8 0000000000013680
17:43:50: ffff883fcb15ffd8 0000000000013680 ffff881fd1434fa0 ffff883fc03d13b0
17:43:50: 7fffffffffffffff ffff881fd1434fa0 0000000000000002 0000000000000000
17:43:50: Call Trace:
17:43:50: [<ffffffff81609e39>] schedule+0x29/0x70
17:43:50: [<ffffffff81607d79>] schedule_timeout+0x209/0x2d0
17:43:50: [<ffffffff81609312>] __down_common+0xd2/0x14a
17:43:50: [<ffffffffa02292dd>] ? _xfs_buf_find+0x16d/0x2c0 [xfs]
17:43:50: [<ffffffff816093a7>] __down+0x1d/0x1f
17:43:50: [<ffffffff8109d311>] down+0x41/0x50
17:43:50: [<ffffffffa02290dc>] xfs_buf_lock+0x3c/0xd0 [xfs]
17:43:50: [<ffffffffa02292dd>] _xfs_buf_find+0x16d/0x2c0 [xfs]
17:43:50: [<ffffffffa022945a>] xfs_buf_get_map+0x2a/0x180 [xfs]
17:43:50: [<ffffffffa0229eec>] xfs_buf_read_map+0x2c/0x140 [xfs]
17:43:50: [<ffffffffa028d809>] xfs_trans_read_buf_map+0x2d9/0x4a0 [xfs]
17:43:50: [<ffffffffa02723cd>] xfs_read_agi+0x9d/0x110 [xfs]
17:43:50: [<ffffffffa0272474>] xfs_ialloc_read_agi+0x34/0xd0 [xfs]
17:43:50: [<ffffffffa0273138>] xfs_dialloc+0xe8/0x270 [xfs]
17:43:50: [<ffffffffa02757c1>] xfs_ialloc+0x71/0x6a0 [xfs]
17:43:50: [<ffffffffa0242b77>] ? kmem_zone_alloc+0x77/0x100 [xfs]
17:43:50: [<ffffffffa0275e6a>] xfs_dir_ialloc+0x7a/0x280 [xfs]
17:43:50: [<ffffffff81609222>] ? down_write+0x12/0x30
17:43:50: [<ffffffffa027656a>] xfs_create+0x48a/0x680 [xfs]
17:43:50: [<ffffffffa0237abb>] xfs_vn_mknod+0xbb/0x1e0 [xfs]
17:43:50: [<ffffffffa0237bf6>] xfs_vn_mkdir+0x16/0x20 [xfs]
17:43:50: [<ffffffff811d26c7>] vfs_mkdir+0xb7/0x160
17:43:50: [<ffffffff811d853f>] SyS_mkdirat+0x6f/0xe0
17:43:50: [<ffffffff811d85c9>] SyS_mkdir+0x19/0x20
17:43:50: [<ffffffff81614a29>] system_call_fastpath+0x16/0x1b
17:43:50: INFO: task app:20228 blocked for more than 120 seconds.
17:43:50: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
17:43:50: app D ffff881fffc93680 0 20228 13315 0x00000080
17:43:50: ffff883fcef7b9e8 0000000000000082 ffff883fcef7bfd8 0000000000013680
17:43:50: ffff883fcef7bfd8 0000000000013680 ffff883fcc1d0b60 ffff881fffc93f48
17:43:50: ffff881fd07ea800 ffff883fcc1d0b60 0000000000000000 0000000000000000
17:43:50: Call Trace:
17:43:50: [<ffffffff8160a13d>] io_schedule+0x9d/0x140
17:43:50: [<ffffffff81204593>] do_blockdev_direct_IO+0xc03/0x2620
17:43:50: [<ffffffffa0222e60>] ? xfs_get_blocks+0x20/0x20 [xfs]
17:43:50: [<ffffffff81206005>] __blockdev_direct_IO+0x55/0x60
17:43:50: [<ffffffffa0222e60>] ? xfs_get_blocks+0x20/0x20 [xfs]
17:43:50: [<ffffffffa0222780>] ? xfs_finish_ioend_sync+0x30/0x30 [xfs]
17:43:50: [<ffffffffa0221b2a>] xfs_vm_direct_IO+0xda/0x180 [xfs]
17:43:50: [<ffffffffa0222e60>] ? xfs_get_blocks+0x20/0x20 [xfs]
17:43:50: [<ffffffffa0222780>] ? xfs_finish_ioend_sync+0x30/0x30 [xfs]
17:43:50: [<ffffffff8115872d>] generic_file_direct_write+0xcd/0x190
17:43:50: [<ffffffffa0299636>] xfs_file_dio_aio_write+0x215/0x254 [xfs]
17:43:50: [<ffffffffa022ee0d>] xfs_file_aio_write+0x13d/0x150 [xfs]
17:43:50: [<ffffffff811c64cd>] do_sync_write+0x8d/0xd0
17:43:50: [<ffffffff811c6c6d>] vfs_write+0xbd/0x1e0
17:43:50: [<ffffffff811c76b8>] SyS_write+0x58/0xb0
17:43:50: [<ffffffff81614a29>] system_call_fastpath+0x16/0x1b
... <Many more sequences like the above.>...
17:54:21: ata7.00: exception Emask 0x0 SAct 0x7e0000ff SErr 0x0 action 0x6 frozen
17:54:21: ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
17:54:21: ata8.00: failed command: FLUSH CACHE EXT
17:54:21: ata8.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 30
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
17:54:21: ata8.00: status: { DRDY }
17:54:21: ata8: hard resetting link
17:54:21: ata9.00: exception Emask 0x0 SAct 0x1fff SErr 0x0 action 0x6 frozen
17:54:21: ata10.00: exception Emask 0x0 SAct 0x1fff SErr 0x0 action 0x6 frozen
17:54:21: ata9.00: failed command: WRITE FPDMA QUEUED
17:54:21: ata10.00: failed command: WRITE FPDMA QUEUED
17:54:21: ata9.00: cmd 61/50:00:40:08:00/00:00:00:00:00/40 tag 0 ncq 40960 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
17:54:21: ata10.00: cmd 61/50:00:40:08:00/00:00:00:00:00/40 tag 0 ncq 40960 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
17:54:21: ata9.00: status: { DRDY }
17:54:21: ata10.00: status: { DRDY }
17:54:21: ata9.00: failed command: WRITE FPDMA QUEUED
17:54:21: ata10.00: failed command: WRITE FPDMA QUEUED
17:54:21: ata9.00: cmd 61/18:00:08:08:00/00:00:00:00:00/40 tag 1 ncq 12288 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
17:54:21: ata10.00: cmd 61/18:00:08:08:00/00:00:00:00:00/40 tag 1 ncq 12288 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
17:54:21: ata9.00: status: { DRDY }
17:54:21: ata10.00: status: { DRDY }
17:54:21: ata9.00: failed command: WRITE FPDMA QUEUED
17:54:21: ata10.00: failed command: WRITE FPDMA QUEUED
17:54:21: ata9.00: cmd 61/02:00:01:08:00/00:00:00:00:00/40 tag 2 ncq 1024 out
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
17:54:21: ata10.00: cmd 61/02:00:01:08:00/00:00:00:00:00/40 tag 2 ncq 1024 out
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
17:54:21: ata9.00: status: { DRDY }
17:54:21: ata10.00: status: { DRDY }
17:54:21: ata9.00: failed command: WRITE FPDMA QUEUED
17:54:21: ata10.00: failed command: WRITE FPDMA QUEUED
17:54:22: ata7.00: status: { DRDY }
17:54:22: ata7: hard resetting link
17:54:24: perf interrupt took too long (2510 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
17:54:31: isci 0000:08:00.0: isci_remote_device_terminate_requests host0 timeout all
17:54:31: isci 0000:08:00.0: isci_remote_device_terminate_requests host0 timeout all
17:54:31: isci 0000:08:00.0: isci_remote_device_terminate_requests host0 timeout all
17:54:32: isci 0000:08:00.0: isci_remote_device_terminate_requests host0 timeout all
17:54:56: isci 0000:08:00.0: isci_port_perform_hard_reset: iport = ffff881fce962ab8; hard reset failed (0x21)
17:54:56: isci 0000:08:00.0: isci_port_perform_hard_reset: iport = ffff881fce962bb0; hard reset failed (0x21)
17:54:56: isci 0000:08:00.0: isci_port_perform_hard_reset: iport = ffff881fce9629c0; hard reset failed (0x21)
17:54:57: isci 0000:08:00.0: isci_port_perform_hard_reset: iport = ffff881fce9628c8; hard reset failed (0x21)
17:55:06: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce964310
17:55:06: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce964490
17:55:06: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce9643d0
17:55:06: sas: ata8: end_device-0:1: reset failed (errno=-16)
17:55:06: sas: ata10: end_device-0:3: reset failed (errno=-16)
17:55:06: ata10: hard resetting link
17:55:06: ata8: hard resetting link
17:55:06: sas: ata9: end_device-0:2: reset failed (errno=-16)
17:55:06: ata9: hard resetting link
17:55:07: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce964250
17:55:07: sas: ata7: end_device-0:0: reset failed (errno=-16)
17:55:07: ata7: hard resetting link
...<More of the above.>...
17:56:22: sas: ata10: end_device-0:3: reset failed (errno=-16)
17:56:22: ata10: reset failed, giving up
17:56:22: sas: ata8: end_device-0:1: reset failed (errno=-16)
17:56:22: ata8: reset failed, giving up
17:56:22: ata8.00: disabled
17:56:22: ata8.00: device reported invalid CHS sector 0
17:56:22: ata8: EH complete
17:56:22: ata10.00: disabled
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10: EH complete
17:56:22: sas: ata9: end_device-0:2: reset failed (errno=-16)
17:56:22: ata9: reset failed, giving up
17:56:22: ata9.00: disabled
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9: EH complete
17:56:22: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce964250
17:56:22: sas: ata7: end_device-0:0: reset failed (errno=-16)
17:56:22: ata7: reset failed, giving up
17:56:22: ata7.00: disabled
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7: EH complete
17:56:22: sd 0:0:0:0: [sda]
17:56:22: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
17:56:22: sd 0:0:0:0: [sda] CDB:
17:56:22: Write(10): 2a 00 08 70 08 01 00 00 02 00
17:56:22: end_request: I/O error, dev sda, sector 141559809
17:56:22: sd 0:0:0:0: [sda]
17:56:22: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
17:56:22: XFS (sda1): metadata I/O error: block 0x8700001 ("xfs_buf_iodone_callbacks") error 5 numblks 1
17:56:22: XFS (sdc1): metadata I/O error: block 0x19500040 ("xfs_buf_iodone_callbacks") error 5 numblks 16
17:56:22: XFS (sdd1): metadata I/O error: block 0x19500040 ("xfs_buf_iodone_callbacks") error 5 numblks 16
17:56:22: XFS (sda1): metadata I/O error: block 0x8700002 ("xfs_buf_iodone_callbacks") error 5 numblks 1
17:56:22: XFS (sdc1): metadata I/O error: block 0x19500050 ("xfs_buf_iodone_callbacks") error 5 numblks 16
17:56:22: XFS (sdd1): metadata I/O error: block 0x19500050 ("xfs_buf_iodone_callbacks") error 5 numblks 16
17:56:22: Buffer I/O error on device sdb1, logical block 35424025
17:56:22: lost page write due to I/O error on sdb1