我在生产环境中有一台 Centos 5.4 服务器,其中有 2 个驱动器采用软件 RAID1。
最后的日子/var/log/消息有很多消息,显示其中一个驱动器即将发生故障:
Sep 23 00:48:38 milkyway kernel: SCSI device sda: 1465149168 512-byte hdwr sectors (750156 MB)
Sep 23 00:48:39 milkyway kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Sep 23 00:48:39 milkyway kernel: ata1.00: irq_stat 0x40000001
Sep 23 00:48:39 milkyway kernel: ata1.00: cmd 25/00:10:31:21:8c/00:00:28:00:00/e0 tag 0 dma 8192 in
Sep 23 00:48:40 milkyway kernel: res 51/40:00:35:21:8c/00:00:28:00:00/e0 Emask 0x9 (media error)
Sep 23 00:48:40 milkyway kernel: ata1.00: status: { DRDY ERR }
Sep 23 00:48:40 milkyway kernel: ata1.00: error: { UNC }
Sep 23 00:48:40 milkyway kernel: ata1.00: configured for UDMA/133
Sep 23 00:48:40 milkyway kernel: ata1: EH complete
Sep 23 00:48:41 milkyway kernel: sda: Write Protect is off
Sep 23 00:48:41 milkyway kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Sep 23 00:48:58 milkyway kernel: ata1.00: irq_stat 0x40000001
Sep 23 00:49:00 milkyway kernel: ata1.00: cmd 25/00:10:31:21:8c/00:00:28:00:00/e0 tag 0 dma 8192 in
Sep 23 00:49:03 milkyway kernel: res 51/40:00:35:21:8c/00:00:28:00:00/e0 Emask 0x9 (media error)
Sep 23 00:49:03 milkyway kernel: ata1.00: status: { DRDY ERR }
Sep 23 00:49:04 milkyway kernel: ata1.00: error: { UNC }
Sep 23 00:49:04 milkyway kernel: ata1.00: configured for UDMA/133
Sep 23 00:49:04 milkyway kernel: ata1: EH complete
Sep 23 00:49:04 milkyway kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Sep 23 00:49:04 milkyway kernel: ata1.00: irq_stat 0x40000001
Sep 23 00:49:04 milkyway kernel: ata1.00: cmd 25/00:10:31:21:8c/00:00:28:00:00/e0 tag 0 dma 8192 in
Sep 23 00:49:04 milkyway kernel: res 51/40:00:35:21:8c/00:00:28:00:00/e0 Emask 0x9 (media error)
Sep 23 00:49:04 milkyway kernel: ata1.00: status: { DRDY ERR }
Sep 23 00:49:04 milkyway kernel: ata1.00: error: { UNC }
Sep 23 00:49:04 milkyway kernel: ata1.00: configured for UDMA/133
Sep 23 00:49:05 milkyway kernel: ata1: EH complete
Sep 23 00:49:05 milkyway kernel: SCSI device sda: drive cache: write back
Sep 23 00:49:06 milkyway kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Sep 23 00:49:06 milkyway kernel: ata1.00: irq_stat 0x40000001
Sep 23 00:49:06 milkyway kernel: ata1.00: cmd 25/00:10:31:21:8c/00:00:28:00:00/e0 tag 0 dma 8192 in
Sep 23 00:49:06 milkyway kernel: res 51/40:00:35:21:8c/00:00:28:00:00/e0 Emask 0x9 (media error)
Sep 23 00:49:06 milkyway kernel: ata1.00: status: { DRDY ERR }
Sep 23 00:49:06 milkyway kernel: ata1.00: error: { UNC }
Sep 23 00:49:06 milkyway kernel: ata1.00: configured for UDMA/133
Sep 23 00:49:08 milkyway kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002
然而/proc/mdstat所有硬盘均未显示降级:
Personalities : [raid1] [raid10] [raid0] [raid6] [raid5] [raid4]
md0 : active raid1 sdb1[1] sda1[0]
4200896 blocks [2/2] [UU]
md1 : active raid1 sdb2[1] sda2[0]
2104448 blocks [2/2] [UU]
md2 : active raid1 sdb3[1] sda3[0]
726266432 blocks [2/2] [UU]
unused devices: <none>
我已开始将所有数据迁移到新服务器。但结果是,由于硬盘故障,目前速度非常慢,几乎不可能传输所有数据。此外,由于硬盘瓶颈,负载猛增,导致服务器无法使用。
是否有可能移除故障驱动器没有丢失数据且无任何停机时间?即使 RAID1 暂时保留 1 个驱动器,我也不介意,以便尽快完成传输而不会出现延迟。
答案1
您可以通过 mdadm 手动将驱动器标记为故障,如下所示:
mdadm --manage /dev/md0 --fail /dev/sda1
然后您就可以从阵列中删除驱动器:
mdadm --manage /dev/md0 --remove /dev/sda1
对所有数组重复此操作。
这将使阵列仅剩下一个驱动器,这有望允许您备份另一个驱动器上的数据
或者
用备用驱动器替换发生故障/故障的驱动器,并通过从良好的驱动器镜像磁盘分区来重建阵列,然后将这些分区添加到 md 设备以进行阵列重建。
但是,“RAID 不是备份”这一常见说法仍然适用,也就是说,在即将发生磁盘故障之前很久就备份阵列的内容是很有先见之明的,尽管这对您现在并不是特别有帮助。