我正在使用一台新安装的远程服务器 (Dell Poweredge)。它有四个驱动器 (2TB) 和 2 个 SSD (250 GB)。一个 SSD 包含操作系统 (RHEL7),四个机械磁盘最终将包含一个 Oracle 数据库。
尝试创建软件 RAID 阵列导致磁盘不断被标记为故障。检查 dmesg 输出一系列以下错误,
[127491.711407] blk_update_request: I/O error, dev sde, sector 3907026080
[127491.719699] sd 0:0:4:0: [sde] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[127491.719717] sd 0:0:4:0: [sde] Sense Key : Aborted Command [current]
[127491.719726] sd 0:0:4:0: [sde] Add. Sense: Logical block guard check failed
[127491.719734] sd 0:0:4:0: [sde] CDB: Read(32)
[127491.719742] sd 0:0:4:0: [sde] CDB[00]: 7f 00 00 00 00 00 00 18 00 09 20 00 00 00 00 00
[127491.719750] sd 0:0:4:0: [sde] CDB[10]: e8 e0 7c a0 e8 e0 7c a0 00 00 00 00 00 00 00 08
[127491.719757] blk_update_request: I/O error, dev sde, sector 3907026080
[127491.719764] Buffer I/O error on dev sde, logical block 488378260, async page read
[127497.440222] sd 0:0:5:0: [sdf] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[127497.440240] sd 0:0:5:0: [sdf] Sense Key : Aborted Command [current]
[127497.440249] sd 0:0:5:0: [sdf] Add. Sense: Logical block guard check failed
[127497.440258] sd 0:0:5:0: [sdf] CDB: Read(32)
[127497.440266] sd 0:0:5:0: [sdf] CDB[00]: 7f 00 00 00 00 00 00 18 00 09 20 00 00 00 00 00
[127497.440273] sd 0:0:5:0: [sdf] CDB[10]: 00 01 a0 00 00 01 a0 00 00 00 00 00 00 00 00 08
[127497.440280] blk_update_request: I/O error, dev sdf, sector 106496
[127497.901432] sd 0:0:5:0: [sdf] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[127497.901449] sd 0:0:5:0: [sdf] Sense Key : Aborted Command [current]
[127497.901458] sd 0:0:5:0: [sdf] Add. Sense: Logical block guard check failed
[127497.901467] sd 0:0:5:0: [sdf] CDB: Read(32)
[127497.901475] sd 0:0:5:0: [sdf] CDB[00]: 7f 00 00 00 00 00 00 18 00 09 20 00 00 00 00 00
[127497.901482] sd 0:0:5:0: [sdf] CDB[10]: e8 e0 7c a0 e8 e0 7c a0 00 00 00 00 00 00 00 08
[127497.901489] blk_update_request: I/O error, dev sdf, sector 3907026080
[127497.911003] sd 0:0:5:0: [sdf] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[127497.911019] sd 0:0:5:0: [sdf] Sense Key : Aborted Command [current]
[127497.911029] sd 0:0:5:0: [sdf] Add. Sense: Logical block guard check failed
[127497.911037] sd 0:0:5:0: [sdf] CDB: Read(32)
[127497.911045] sd 0:0:5:0: [sdf] CDB[00]: 7f 00 00 00 00 00 00 18 00 09 20 00 00 00 00 00
[127497.911052] sd 0:0:5:0: [sdf] CDB[10]: e8 e0 7c a0 e8 e0 7c a0 00 00 00 00 00 00 00 08
[127497.911059] blk_update_request: I/O error, dev sdf, sector 3907026080
[127497.911067] Buffer I/O error on dev sdf, logical block 488378260, async page read
所有四个机械磁盘 (sdc/sdd/sde/sdf) 都出现了这些错误。SMARTctl 通过了所有四个磁盘的长短测试。我目前正在运行坏块 (写入模式测试已进行约 35 小时,可能还需要 35 小时)。
以下是我研究后怀疑/考虑的错误
硬盘故障 — 4 个“翻新”磁盘似乎不太可能发生故障,不是吗?
存储控制器问题(电缆不好?)——似乎也会影响 SSD?
- 内核问题,对原有内核的唯一更改是添加了 kmod-oracleasm。我真的不明白它如何会导致这些错误,ASM 根本没有设置。
另一个值得注意的事件是,当尝试将磁盘归零(早期故障排除的一部分)时,使用命令 $ dd if=/dev/zero of=/dev/sdX 产生了这些错误,
dd: writing to ‘/dev/sdc’: Input/output error
106497+0 records in
106496+0 records out
54525952 bytes (55 MB) copied, 1.70583 s, 32.0 MB/s
dd: writing to ‘/dev/sdd’: Input/output error
106497+0 records in
106496+0 records out
54525952 bytes (55 MB) copied, 1.70417 s, 32.0 MB/s
dd: writing to ‘/dev/sde’: Input/output error
106497+0 records in
106496+0 records out
54525952 bytes (55 MB) copied, 1.71813 s, 31.7 MB/s
dd: writing to ‘/dev/sdf’: Input/output error
106497+0 records in
106496+0 records out
54525952 bytes (55 MB) copied, 1.71157 s, 31.9 MB/s
如果这里有人能分享一些关于导致这种情况的原因的见解,我将不胜感激。我倾向于遵循奥卡姆剃刀原理,直接找到硬盘,唯一的疑问来自于开箱时四块硬盘发生故障的可能性不大。
我明天会开车去现场进行实物检查,并向上级汇报我对这台机器的评估。如果我需要进行实物检查(除了电缆/连接/电源),请告诉我。
谢谢。
答案1
您的dd
测试显示四个磁盘全部发生故障相同的 LBA地址。由于四个磁盘同时在同一个位置发生故障的可能性极小,因此我强烈怀疑这是由于控制器或电缆问题造成的。