我有一对 HP DL320e 服务器,配置相同,在软件 raid 1 阵列中有两个 WD Red 6TB 硬盘。DL320e 有一个板载 raid 控制器,该控制器被禁用,以支持 linux 软件 raid。
两台机器似乎都运行良好,RAID 阵列看起来也正常,只是每次运行 raid-check 时(默认的每周 crontab 中的周日凌晨 1 点,如果我手动运行 raid-check),一个驱动器就会脱机。此后,“故障”驱动器的设备文件已被删除(例如 /dev/sda2),但在冷重启后它们会重新出现,并且可以将“故障”驱动器重新添加到阵列中,并且似乎可以正常工作。
自从几个月前安装了(全新的)机器和磁盘以来,这种情况一直存在。根据 smartctl,没有任何驱动器被换出任何坏扇区,因此根据其他地方的几篇帖子,我尝试使用 hdparm 覆盖 /var/log/messages 中标识的扇区,以强制驱动器检测并换出坏扇区,但没有效果。
我还尝试使用 dd 将整个 /dev/sdb2 和 /dev/sdb3 写入零。此操作完成时没有导致任何错误,但也没有导致任何坏扇区被换出,但似乎表明可以成功写入整个驱动器表面。
我已经使用 smartctl 运行了所有智能诊断并且一切正常。
由于这些机器都是新安装的,而且两个系统都出现了故障,而且 4 个驱动器中至少有 3 个“出现故障”(一台机器上的两个驱动器在不同时间出现故障),我不愿意相信这是由硬件故障引起的。对一个故障驱动器执行 /dev/zero dd 操作已完成,这证明该驱动器的整个表面均可写入。
驱动器配置了 3 个分区:biosboot、/boot 和 root + /home。
虽然两个服务器报告的扇区编号不同,但其日志大体相同,并且同一台服务器每周报告的扇区编号也不同。
/proc/mdstat 报告
sh-4.2# cat /proc/mdstat
Personalities : [raid1]
md126 : active raid1 sda3[0] sdb3[1]
5859876672 blocks super 1.2 [2/2] [UU]
bitmap: 2/44 pages [8KB], 65536KB chunk
md127 : active raid1 sda2[2] sdb2[3]
511936 blocks super 1.0 [2/2] [UU]
unused devices: <none>
sh-4.2#
时间一直持续到周日凌晨 1 点,然后:
WARNING: Your hard drive is failing
Device: /dev/sda [SAT], unable to open device
sh-4.2# cat /proc/mdstat
Personalities : [raid1]
md126 : active raid1 sda3[0](F) sdb3[1]
5859876672 blocks super 1.2 [2/1] [_U]
bitmap: 5/44 pages [20KB], 65536KB chunk
md127 : active raid1 sda2[2](F) sdb2[3]
511936 blocks super 1.0 [2/1] [_U]
unused devices: <none>
sh-4.2#
/var/log/messages 报告
Jun 7 01:00:01 1000 kernel: md: data-check of RAID array md126
Jun 7 01:00:01 1000 kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Jun 7 01:00:01 1000 kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Jun 7 01:00:01 1000 kernel: md: using 128k window, over a total of 5859876672k.
Jun 7 01:00:07 1000 kernel: md: delaying data-check of md127 until md126 has finished (they share one or more physical units)
Jun 7 01:01:01 1000 systemd: Starting Session 1544 of user root.
Jun 7 01:01:01 1000 systemd: Started Session 1544 of user root.
Jun 7 01:03:43 1000 kernel: ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x40000 action 0x6 frozen
Jun 7 01:03:43 1000 kernel: ata1: SError: { CommWake }
Jun 7 01:03:43 1000 kernel: ata1.00: failed command: READ FPDMA QUEUED
Jun 7 01:03:43 1000 kernel: ata1.00: cmd 60/80:00:80:1b:70/00:00:03:00:00/40 tag 0 ncq 65536 in
res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 7 01:03:43 1000 kernel: ata1.00: status: { DRDY }
Jun 7 01:03:43 1000 kernel: ata1.00: failed command: READ FPDMA QUEUED
Jun 7 01:03:43 1000 kernel: ata1.00: cmd 60/80:08:00:1c:70/00:00:03:00:00/40 tag 1 ncq 65536 in
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 7 01:03:43 1000 kernel: ata1.00: status: { DRDY }
Jun 7 01:03:43 1000 kernel: ata1.00: failed command: READ FPDMA QUEUED
Jun 7 01:03:43 1000 kernel: ata1.00: cmd 60/80:10:00:0d:70/00:00:03:00:00/40 tag 2 ncq 65536 in
res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
重复增加标签值直至 30,然后
Jun 7 01:07:10 1000 kernel: ata1.00: cmd 60/80:f0:00:cd:7f/00:00:06:00:00/40 tag 30 ncq 65536 in
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 7 01:07:10 1000 kernel: ata1.00: status: { DRDY }
Jun 7 01:07:10 1000 kernel: ata1: hard resetting link
Jun 7 01:07:11 1000 kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jun 7 01:07:11 1000 kernel: ata1.00: configured for UDMA/133
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1: EH complete
Jun 7 01:09:53 1000 kernel: ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x40000 action 0x6 frozen
Jun 7 01:09:53 1000 kernel: ata1: SError: { CommWake }
Jun 7 01:09:53 1000 kernel: ata1.00: failed command: READ FPDMA QUEUED
Jun 7 01:09:53 1000 kernel: ata1.00: cmd 60/80:00:80:f6:dd/00:00:08:00:00/40 tag 0 ncq 65536 in
res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 7 01:09:53 1000 kernel: ata1.00: status: { DRDY }
Jun 7 01:09:53 1000 kernel: ata1.00: failed command: READ FPDMA QUEUED
Jun 7 01:09:53 1000 kernel: ata1.00: cmd 60/80:08:00:f7:dd/00:00:08:00:00/40 tag 1 ncq 65536 in
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 7 01:09:53 1000 kernel: ata1.00: status: { DRDY }
重复次数最多为 30 次
Jun 7 01:09:53 1000 kernel: ata1.00: failed command: READ FPDMA QUEUED
Jun 7 01:09:53 1000 kernel: ata1.00: cmd 60/80:f0:00:f6:dd/00:00:08:00:00/40 tag 30 ncq 65536 in
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 7 01:09:53 1000 kernel: ata1.00: status: { DRDY }
Jun 7 01:09:53 1000 kernel: ata1: hard resetting link
Jun 7 01:09:59 1000 kernel: ata1: link is slow to respond, please be patient (ready=0)
Jun 7 01:10:01 1000 systemd: Starting Session 1545 of user root.
Jun 7 01:10:01 1000 systemd: Started Session 1545 of user root.
Jun 7 01:10:03 1000 kernel: ata1: COMRESET failed (errno=-16)
Jun 7 01:10:03 1000 kernel: ata1: hard resetting link
Jun 7 01:10:04 1000 kernel: ata1: SATA link down (SStatus 0 SControl 300)
Jun 7 01:10:09 1000 kernel: ata1: hard resetting link
Jun 7 01:10:09 1000 kernel: ata1: SATA link down (SStatus 0 SControl 300)
Jun 7 01:10:09 1000 kernel: ata1: limiting SATA link speed to 1.5 Gbps
Jun 7 01:10:14 1000 kernel: ata1: hard resetting link
Jun 7 01:10:14 1000 kernel: ata1: SATA link down (SStatus 0 SControl 310)
Jun 7 01:10:14 1000 kernel: ata1.00: disabled
Jun 7 01:10:14 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:10:14 1000 kernel: ata1.00: device reported invalid CHS sector 0
再来一块
Jun 7 01:10:14 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: [sda]
Jun 7 01:10:14 1000 kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: [sda]
Jun 7 01:10:14 1000 kernel: Sense Key : Aborted Command [current] [descriptor]
Jun 7 01:10:14 1000 kernel: Descriptor sense data with sense descriptors (in hex):
Jun 7 01:10:14 1000 kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
Jun 7 01:10:14 1000 kernel: 00 00 00 00
Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: [sda]
Jun 7 01:10:14 1000 kernel: Add. Sense: No additional sense information
Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: [sda] CDB:
Jun 7 01:10:14 1000 kernel: Read(16): 88 00 00 00 00 00 08 dd f6 80 00 00 00 80 00 00
Jun 7 01:10:14 1000 kernel: end_request: I/O error, dev sda, sector 148764288
Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: rejecting I/O to offline device
Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: [sda] killing request
Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: [sda]
最后
Jun 7 01:10:14 1000 kernel: Read(16): 88 00 00 00 00 00 08 dd fb 00 00 00 00 80 00 00
Jun 7 01:10:14 1000 kernel: end_request: I/O error, dev sda, sector 148765440
Jun 7 01:10:14 1000 kernel: ata1: EH complete
Jun 7 01:10:14 1000 kernel: md: super_written gets error=-5, uptodate=0
Jun 7 01:10:14 1000 kernel: md/raid1:md126: Disk failure on sda3, disabling device.
md/raid1:md126: Operation continuing on 1 devices.
Jun 7 01:10:14 1000 kernel: ata1.00: detaching (SCSI 0:0:0:0)
Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: [sda] Stopping disk
Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: [sda] START_STOP FAILED
Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: [sda]
Jun 7 01:10:14 1000 kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 7 01:10:14 1000 udisksd[3364]: Unable to resolve /sys/devices/virtual/block/md126/md/dev-sda3/block symlink
Jun 7 01:10:14 1000 kernel: md: md126: data-check interrupted.
Jun 7 01:10:14 1000 kernel: md: super_written gets error=-19, uptodate=0
Jun 7 01:10:14 1000 kernel: md/raid1:md127: Disk failure on sda2, disabling device.
md/raid1:md127: Operation continuing on 1 devices.
Jun 7 01:10:15 1000 kernel: md: md127 still in use.
Jun 7 01:10:15 1000 kernel: md: md126 still in use.
Jun 7 01:10:15 1000 udisksd[3364]: Unable to resolve /sys/devices/virtual/block/md127/md/dev-sda2/block symlink
Jun 7 01:10:15 1000 udisksd[3364]: Unable to resolve /sys/devices/virtual/block/md126/md/dev-sda3/block symlink
Jun 7 01:10:15 1000 udisksd[3364]: Unable to resolve /sys/devices/virtual/block/md127/md/dev-sda2/block symlink
Jun 7 01:10:15 1000 udisksd[3364]: Unable to resolve /sys/devices/virtual/block/md127/md/dev-sda2/block symlink
Jun 7 01:10:15 1000 udisksd[3364]: Unable to resolve /sys/devices/virtual/block/md126/md/dev-sda3/block symlink
Jun 7 01:20:01 1000 systemd: Created slice user-0.slice.
Jun 7 01:20:01 1000 systemd: Starting Session 1546 of user root.
Jun 7 01:20:01 1000 systemd: Started Session 1546 of user root.
Jun 7 01:30:01 1000 systemd: Created slice user-0.slice.
Jun 7 01:30:01 1000 systemd: Starting Session 1547 of user root.
Jun 7 01:30:01 1000 systemd: Started Session 1547 of user root.
Jun 7 01:36:58 1000 smartd[977]: Device: /dev/sda [SAT], open() failed: No such device
Jun 7 01:36:58 1000 smartd[977]: Sending warning via /usr/libexec/smartmontools/smartdnotify to root ...
Jun 7 01:36:58 1000 smartd[977]: Warning via /usr/libexec/smartmontools/smartdnotify to root produced unexpected output (80 bytes) to STDOUT/STDERR:
Jun 7 01:36:58 1000 smartd[977]: /usr/libexec/smartmontools/smartdnotify: line 13: /dev/pts/0: Permission denied
Jun 7 01:36:58 1000 smartd[977]: Warning via /usr/libexec/smartmontools/smartdnotify to root: successful
如果有人能指出我这里可能出了什么问题,我将非常感激。