我这里有一点问题。我有一台 Ubuntu Linux 服务器,在软件 RAID 1(使用 mdadm 创建)中设置了 2 个 SAS 驱动器。RAID 可以正常运行一天,我可以执行 cat /proc/mdstat,它显示两个磁盘都处于活动状态,一切正常。然后意外的是,第二个磁盘将发生故障,并进入降级模式。
然后,我将从 RAID 组中移除磁盘,重新启动服务器,然后将磁盘重新添加到组中。RAID 将自行重建,不会出现任何问题,我将使用相同的磁盘再次运行健康的 RAID 1。然后,在 12-24 小时左右的时间内,第二个驱动器将发生故障。
硬盘是全新的,所以我认为硬件没有问题。以下是磁盘发生故障时我能从 kern.log 和 syslog 捕获的输出。
有人可以翻译这个或者知道可能发生了什么吗?
谢谢!
内核日志
Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.180815] sd 2:0:0:0: Attached scsi generic sg1 type 0
Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.181086] sd 2:0:1:0: Attached scsi generic sg2 type 0
Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.181376] sd 2:0:1:0: [sdb] 71096640 512-byte logical blocks: (36.4 GB/33.9 GiB)
Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.182584] sd 2:0:1:0: [sdb] Write Protect is off
Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.182591] sd 2:0:1:0: [sdb] Mode Sense: cb 00 10 08
Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.182835] sd 2:0:0:0: [sda] 71096640 512-byte logical blocks: (36.4 GB/33.9 GiB)
Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.183802] sd 2:0:1:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.185146] sd 2:0:0:0: [sda] Write Protect is off
Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.185151] sd 2:0:0:0: [sda] Mode Sense: cb 00 10 08
Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.188191] sd 2:0:0:0: [sda] Write cache: disabled, read cache: enabled, supports DPO and FUA
Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.191403] sd 2:0:1:0: [sdb] Attached SCSI disk
Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.299351] sd 2:0:0:0: [sda] Attached SCSI disk
Mar 1 09:01:22 CSTEP-APPS20 kernel: [44807.010040] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar 1 09:01:32 CSTEP-APPS20 kernel: [44817.560056] sd 2:0:1:0: [sdb] CDB: Test Unit Ready: 00 00 00 00 00 00
Mar 1 09:02:03 CSTEP-APPS20 kernel: [44848.470035] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 c0 00 00 80 00
Mar 1 09:02:03 CSTEP-APPS20 kernel: [44848.720124] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar 1 09:02:04 CSTEP-APPS20 kernel: [44849.512078] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380147] sd 2:0:1:0: Device offlined - not ready after error recovery
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380153] sd 2:0:1:0: Device offlined - not ready after error recovery
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380167] sd 2:0:1:0: rejecting I/O to offline device
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380285] sd 2:0:1:0: rejecting I/O to offline device
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380403] sd 2:0:1:0: [sdb] Unhandled error code
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380407] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380416] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380677] sd 2:0:1:0: [sdb] Unhandled error code
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380680] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380684] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 c0 00 00 80 00
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380915] sd 2:0:1:0: rejecting I/O to offline device
和系统日志
Mar 1 09:01:43 CSTEP-APPS20 kernel: [44827.860060] mptscsih: ioc0: WARNING - Issuing Reset from mptscsih_IssueTaskMgmt!!
Mar 1 09:01:43 CSTEP-APPS20 kernel: [44827.860070] mptbase: ioc0: Initiating recovery
Mar 1 09:02:03 CSTEP-APPS20 kernel: [44848.470023] mptscsih: ioc0: task abort: SUCCESS (sc=ffff88016197b400)
Mar 1 09:02:03 CSTEP-APPS20 kernel: [44848.470030] mptscsih: ioc0: attempting task abort! (sc=ffff880156fa4c00)
Mar 1 09:02:03 CSTEP-APPS20 kernel: [44848.470035] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 c0 00 00 80 00
Mar 1 09:02:03 CSTEP-APPS20 kernel: [44848.470050] mptscsih: ioc0: task abort: SUCCESS (sc=ffff880156fa4c00)
Mar 1 09:02:03 CSTEP-APPS20 kernel: [44848.470073] scsi target2:0:0: Beginning Domain Validation
Mar 1 09:02:03 CSTEP-APPS20 kernel: [44848.720120] mptscsih: ioc0: attempting target reset! (sc=ffff88016197b400)
Mar 1 09:02:03 CSTEP-APPS20 kernel: [44848.720124] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar 1 09:02:04 CSTEP-APPS20 kernel: [44849.262008] mptscsih: ioc0: target reset: SUCCESS (sc=ffff88016197b400)
Mar 1 09:02:04 CSTEP-APPS20 kernel: [44849.512073] mptscsih: ioc0: attempting bus reset! (sc=ffff88016197b400)
Mar 1 09:02:04 CSTEP-APPS20 kernel: [44849.512078] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar 1 09:02:05 CSTEP-APPS20 kernel: [44850.046491] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff88016197b400)
Mar 1 09:02:15 CSTEP-APPS20 kernel: [44860.553909] mptscsih: ioc0: attempting host reset! (sc=ffff88016197b400)
Mar 1 09:02:15 CSTEP-APPS20 kernel: [44860.553915] mptbase: ioc0: Initiating recovery
Mar 1 09:02:35 CSTEP-APPS20 kernel: [44879.870026] mptscsih: ioc0: host reset: SUCCESS (sc=ffff88016197b400)
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380147] sd 2:0:1:0: Device offlined - not ready after error recovery
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380153] sd 2:0:1:0: Device offlined - not ready after error recovery
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380167] sd 2:0:1:0: rejecting I/O to offline device
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380285] sd 2:0:1:0: rejecting I/O to offline device
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380403] sd 2:0:1:0: [sdb] Unhandled error code
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380407] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380416] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380429] end_request: I/O error, dev sdb, sector 55297928
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380562] __ratelimit: 24 callbacks suppressed
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380566] raid1: sdb1: rescheduling sector 55295880
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380677] sd 2:0:1:0: [sdb] Unhandled error code
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380680] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380684] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 c0 00 00 80 00
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380695] end_request: I/O error, dev sdb, sector 55297984
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380817] raid1: sdb1: rescheduling sector 55295936
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380915] sd 2:0:1:0: rejecting I/O to offline device
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.381019] end_request: I/O error, dev sdb, sector 63983488
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.381142] md: super_written gets error=-5, uptodate=0
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.381146] raid1: Disk failure on sdb1, disabling device.
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.381148] raid1: Operation continuing on 1 devices.
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.398144] scsi target2:0:0: Ending Domain Validation
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.398226] scsi target2:0:0: FAST-160 WIDE SCSI 320.0 MB/s DT IU RTI WRFLOW PCOMP (6.25 ns, offset 127)
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.398295] scsi target2:0:1: Beginning Domain Validation
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.648493] scsi target2:0:1: Domain Validation Initial Inquiry Failed
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.648623] scsi target2:0:1: Ending Domain Validation
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.648691] scsi target2:0:1: asynchronous
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.648760] scsi target2:0:8: Beginning Domain Validation
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.649386] scsi target2:0:8: Ending Domain Validation
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.649458] scsi target2:0:8: asynchronous
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.653384] RAID1 conf printout:
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.653390] --- wd:1 rd:2
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.653395] disk 0, wo:0, o:1, dev:sda1
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.653399] disk 1, wo:1, o:0, dev:sdb1
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.693763] RAID1 conf printout:
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.693767] --- wd:1 rd:2
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.693771] disk 0, wo:0, o:1, dev:sda1
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.714266] raid1: sda1: redirecting sector 55295880 to another mirror
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.719943] raid1: sda1: redirecting sector 55295936 to another mirror
答案1
看起来设备 /dev/sdb 即将脱机。您可能遇到了电缆问题,但磁盘问题也同样可能存在。磁盘固件和控制器之间也存在冲突。
我会立即对磁盘运行制造商的诊断程序。虽然它们是全新的,但我不会怀疑它们有缺陷。(事实上,作为全新的磁盘,我对它们的怀疑程度比已经运行了几个月的磁盘要高一些。)
答案2
我不明白你为什么认为硬盘没问题。即使是新硬盘也会出问题。哎呀,根据我的专业经验,硬盘的早期故障和老年故障一样常见。这就是为什么许多商店会对他们的设备进行磨合期。
用已知良好的驱动器替换该驱动器并查看发生的情况,或者至少通过 SMART 或诊断工具查看坏块的数量。