我在我的服务器上使用软 raid linux raid1。上周六,磁盘出现故障,因为我从日志中看到下面的错误
Mar 16 08:38:40 storage-1 kernel: [694968.826388] ata2.01: status: { DRDY ERR }
Mar 16 08:38:40 storage-1 kernel: [694968.826412] ata2.01: error: { UNC }
Mar 16 08:38:40 storage-1 kernel: [694968.848390] ata2.00: configured for UDMA/133
Mar 16 08:38:40 storage-1 kernel: [694968.864359] ata2.01: configured for UDMA/133
Mar 16 08:38:40 storage-1 kernel: [694968.864366] sd 1:0:1:0: [sdc] Unhandled sense code
Mar 16 08:38:40 storage-1 kernel: [694968.864368] sd 1:0:1:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Mar 16 08:38:40 storage-1 kernel: [694968.864371] sd 1:0:1:0: [sdc] Sense Key : Medium Error [current] [descriptor]
Mar 16 08:38:40 storage-1 kernel: [694968.864374] Descriptor sense data with sense descriptors (in hex):
Mar 16 08:38:40 storage-1 kernel: [694968.864376] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Mar 16 08:38:40 storage-1 kernel: [694968.864382] 05 10 b7 3f
Mar 16 08:38:40 storage-1 kernel: [694968.864384] sd 1:0:1:0: [sdc] Add. Sense: Unrecovered read error - auto reallocate failed
Mar 16 08:38:40 storage-1 kernel: [694968.864388] sd 1:0:1:0: [sdc] CDB: Read(10): 28 00 05 10 b7 3f 00 00 90 00
Mar 16 08:38:40 storage-1 kernel: [694968.864393] end_request: I/O error, dev sdc, sector 84981567
Mar 16 08:38:40 storage-1 kernel: [694968.864421] raid1: sdc1: rescheduling sector 84981504
Mar 16 08:38:40 storage-1 kernel: [694968.864451] ata2: EH complete
Mar 16 08:38:40 storage-1 kernel: [694973.825824] ata2.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Mar 16 08:38:40 storage-1 kernel: [694973.825854] ata2.01: failed command: READ DMA
Mar 16 08:38:40 storage-1 kernel: [694973.825880] ata2.01: cmd c8/00:20:3f:ba:10/00:00:00:00:00/f5 tag 0 dma 16384 in
Mar 16 08:38:40 storage-1 kernel: [694973.825882] res 51/40:20:3f:ba:10/00:00:00:00:00/f5 Emask 0x9 (media error)
但是当我检查时cat /proc/mdstat
,mdadm 没有检测到这个磁盘故障,它仍然将磁盘安装在分区 md3 中,如下所示
rivo@storage-1:~$ cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sdc1[0] sdd1[1]
976759936 blocks [2/2] [UU]
这会导致 I/O 问题,从而减慢服务器的访问速度。
有谁知道为什么 mdadm 没有检测到这个磁盘故障,所以它会自动从 raid 中移除故障磁盘?
有没有什么方法可以更好地配置 mdadm,以便它可以在将来检测到此类中断?
答案1
mdadm
不监控驱动器上的问题 - 它只知道磁盘是否运行以及是否可以同步。这不是确切的解释,也许其他人知道并会写更多关于它的内容. 为了更好地监督驱动器使用智能工具及其守护进程smartd
。如果您想在检测到错误时收到邮件,配置文件中应该有类似以下内容(/etc/smartd.conf
):
/dev/sda -d ata -H -m [email protected]
/dev/sdb -d ata -H -m [email protected]
要检查驱动器信息,请使用smartctl
:
smartctl -a /dev/sda