今晚我收到了服务器上的 mdadm 生成的一条消息:
This is an automatically generated mail message from mdadm
A DegradedArray event had been detected on md device /dev/md3.
Faithfully yours, etc.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [raid1]
md4 : active raid1 sdb4[0] sda4[1]
474335104 blocks [2/2] [UU]
md3 : active raid1 sdb3[2](F) sda3[1]
10000384 blocks [2/1] [_U]
md2 : active (auto-read-only) raid1 sdb2[0] sda2[1]
4000064 blocks [2/2] [UU]
md1 : active raid1 sdb1[0] sda1[1]
48064 blocks [2/2] [UU]
我从 /dev/md3 中删除了 /dev/sdb3 并重新添加了它,它重建了一段时间并成为备用设备,所以现在我有这样的统计数据:
cat /proc/mdstat
Personalities : [raid1]
md4 : active raid1 sdb4[0] sda4[1]
474335104 blocks [2/2] [UU]
md3 : active raid1 sdb3[2](S) sda3[1]
10000384 blocks [2/1] [_U]
md2 : active (auto-read-only) raid1 sdb2[0] sda2[1]
4000064 blocks [2/2] [UU]
md1 : active raid1 sdb1[0] sda1[1]
48064 blocks [2/2] [UU]
和
[代码]
mdadm -D /dev/md3
/dev/md3:
Version : 0.90
Creation Time : Sat Jun 28 14:47:58 2008
Raid Level : raid1
Array Size : 10000384 (9.54 GiB 10.24 GB)
Used Dev Size : 10000384 (9.54 GiB 10.24 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 3
Persistence : Superblock is persistent
Update Time : Sun Sep 4 16:30:46 2011
State : clean, degraded
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1
UUID : 1c32c34a:52d09232:fc218793:7801d094
Events : 0.7172118
Number Major Minor RaidDevice State
0 0 0 0 removed
1 8 3 1 active sync /dev/sda3
2 8 19 - spare /dev/sdb3
这是 /var/log/messages 中的最新日志
Sep 4 16:15:45 ogw2 kernel: [1314646.950806] md: unbind<sdb3>
Sep 4 16:15:45 ogw2 kernel: [1314646.950820] md: export_rdev(sdb3)
Sep 4 16:17:00 ogw2 kernel: [1314721.977950] md: bind<sdb3>
Sep 4 16:17:00 ogw2 kernel: [1314722.011058] RAID1 conf printout:
Sep 4 16:17:00 ogw2 kernel: [1314722.011064] --- wd:1 rd:2
Sep 4 16:17:00 ogw2 kernel: [1314722.011070] disk 0, wo:1, o:1, dev:sdb3
Sep 4 16:17:00 ogw2 kernel: [1314722.011073] disk 1, wo:0, o:1, dev:sda3
Sep 4 16:17:00 ogw2 kernel: [1314722.012667] md: recovery of RAID array md3
Sep 4 16:17:00 ogw2 kernel: [1314722.012673] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Sep 4 16:17:00 ogw2 kernel: [1314722.012677] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Sep 4 16:17:00 ogw2 kernel: [1314722.012684] md: using 128k window, over a total of 10000384 blocks.
Sep 4 16:20:25 ogw2 kernel: [1314927.480582] md: md3: recovery done.
Sep 4 16:20:27 ogw2 kernel: [1314929.252395] ata2.00: configured for UDMA/133
Sep 4 16:20:27 ogw2 kernel: [1314929.260419] ata2.01: configured for UDMA/133
Sep 4 16:20:27 ogw2 kernel: [1314929.260437] ata2: EH complete
Sep 4 16:20:29 ogw2 kernel: [1314931.068402] ata2.00: configured for UDMA/133
Sep 4 16:20:29 ogw2 kernel: [1314931.076418] ata2.01: configured for UDMA/133
Sep 4 16:20:29 ogw2 kernel: [1314931.076436] ata2: EH complete
Sep 4 16:20:30 ogw2 kernel: [1314932.884390] ata2.00: configured for UDMA/133
Sep 4 16:20:30 ogw2 kernel: [1314932.892419] ata2.01: configured for UDMA/133
Sep 4 16:20:30 ogw2 kernel: [1314932.892436] ata2: EH complete
Sep 4 16:20:32 ogw2 kernel: [1314934.828390] ata2.00: configured for UDMA/133
Sep 4 16:20:32 ogw2 kernel: [1314934.836397] ata2.01: configured for UDMA/133
Sep 4 16:20:32 ogw2 kernel: [1314934.836413] ata2: EH complete
Sep 4 16:20:34 ogw2 kernel: [1314936.776392] ata2.00: configured for UDMA/133
Sep 4 16:20:34 ogw2 kernel: [1314936.784403] ata2.01: configured for UDMA/133
Sep 4 16:20:34 ogw2 kernel: [1314936.784419] ata2: EH complete
Sep 4 16:20:36 ogw2 kernel: [1314938.760392] ata2.00: configured for UDMA/133
Sep 4 16:20:36 ogw2 kernel: [1314938.768395] ata2.01: configured for UDMA/133
Sep 4 16:20:36 ogw2 kernel: [1314938.768422] sd 1:0:0:0: [sda] Unhandled sense code
Sep 4 16:20:36 ogw2 kernel: [1314938.768426] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Sep 4 16:20:36 ogw2 kernel: [1314938.768431] sd 1:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor]
Sep 4 16:20:36 ogw2 kernel: [1314938.768438] Descriptor sense data with sense descriptors (in hex):
Sep 4 16:20:36 ogw2 kernel: [1314938.768441] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Sep 4 16:20:36 ogw2 kernel: [1314938.768454] 01 ac b6 4a
Sep 4 16:20:36 ogw2 kernel: [1314938.768459] sd 1:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
Sep 4 16:20:36 ogw2 kernel: [1314938.768468] sd 1:0:0:0: [sda] CDB: Read(10): 28 00 01 ac b5 f8 00 03 80 00
Sep 4 16:20:36 ogw2 kernel: [1314938.768527] ata2: EH complete
Sep 4 16:20:38 ogw2 kernel: [1314940.788406] ata2.00: configured for UDMA/133
Sep 4 16:20:38 ogw2 kernel: [1314940.796394] ata2.01: configured for UDMA/133
Sep 4 16:20:38 ogw2 kernel: [1314940.796415] ata2: EH complete
Sep 4 16:20:40 ogw2 kernel: [1314942.728391] ata2.00: configured for UDMA/133
Sep 4 16:20:40 ogw2 kernel: [1314942.736395] ata2.01: configured for UDMA/133
Sep 4 16:20:40 ogw2 kernel: [1314942.736413] ata2: EH complete
Sep 4 16:20:42 ogw2 kernel: [1314944.548391] ata2.00: configured for UDMA/133
Sep 4 16:20:42 ogw2 kernel: [1314944.556393] ata2.01: configured for UDMA/133
Sep 4 16:20:42 ogw2 kernel: [1314944.556414] ata2: EH complete
Sep 4 16:20:44 ogw2 kernel: [1314946.372392] ata2.00: configured for UDMA/133
Sep 4 16:20:44 ogw2 kernel: [1314946.380392] ata2.01: configured for UDMA/133
Sep 4 16:20:44 ogw2 kernel: [1314946.380411] ata2: EH complete
Sep 4 16:20:46 ogw2 kernel: [1314948.196391] ata2.00: configured for UDMA/133
Sep 4 16:20:46 ogw2 kernel: [1314948.204391] ata2.01: configured for UDMA/133
Sep 4 16:20:46 ogw2 kernel: [1314948.204411] ata2: EH complete
Sep 4 16:20:48 ogw2 kernel: [1314950.144390] ata2.00: configured for UDMA/133
Sep 4 16:20:48 ogw2 kernel: [1314950.152392] ata2.01: configured for UDMA/133
Sep 4 16:20:48 ogw2 kernel: [1314950.152416] sd 1:0:0:0: [sda] Unhandled sense code
Sep 4 16:20:48 ogw2 kernel: [1314950.152419] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Sep 4 16:20:48 ogw2 kernel: [1314950.152424] sd 1:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor]
Sep 4 16:20:48 ogw2 kernel: [1314950.152431] Descriptor sense data with sense descriptors (in hex):
Sep 4 16:20:48 ogw2 kernel: [1314950.152434] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Sep 4 16:20:48 ogw2 kernel: [1314950.152447] 01 ac b6 4a
Sep 4 16:20:48 ogw2 kernel: [1314950.152452] sd 1:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
Sep 4 16:20:48 ogw2 kernel: [1314950.152461] sd 1:0:0:0: [sda] CDB: Read(10): 28 00 01 ac b6 48 00 00 08 00
Sep 4 16:20:48 ogw2 kernel: [1314950.152523] ata2: EH complete
Sep 4 16:20:48 ogw2 kernel: [1314950.575325] RAID1 conf printout:
Sep 4 16:20:48 ogw2 kernel: [1314950.575332] --- wd:1 rd:2
Sep 4 16:20:48 ogw2 kernel: [1314950.575337] disk 0, wo:1, o:1, dev:sdb3
Sep 4 16:20:48 ogw2 kernel: [1314950.575341] disk 1, wo:0, o:1, dev:sda3
Sep 4 16:20:48 ogw2 kernel: [1314950.575344] RAID1 conf printout:
Sep 4 16:20:48 ogw2 kernel: [1314950.575347] --- wd:1 rd:2
Sep 4 16:20:48 ogw2 kernel: [1314950.575350] disk 1, wo:0, o:1, dev:sda3
所以我不明白为什么这个设备(sdb3)成为备用设备并且 RAID 没有同步......
有人能告诉我该怎么做吗?
更新:忘了说 /dev/md3 被安装为 / (根)分区,并且包含除 /boot 之外的所有系统目录。
答案1
看起来 MD 保存了错误的设备。sda 出现问题,在从其读取块以重新同步 sdb 时引发了无法恢复的读取错误。
删除 sdb 后,sda 上的数据会发生变化吗?如果没有,那么你可能很幸运 - 即使在重新同步失败后,sdb 上的文件系统可能仍处于一致状态;让 MD 使用 sdb 组装阵列。
但这有点不太可能;更有可能的是,您将有一个很好的机会来了解您的备份策略的效果如何。
答案2
请注意,您的所有 MD 阵列都处于危险之中——不仅仅是“正式”降级的阵列——因为它们都基于两个物理设备:sda
和sdb
。我真心希望您有适当的备份和/或系统恢复程序,以防万一情况真的变得很糟糕。正如 Shane Madden 所说,重新同步的日志显示了一个令人担忧的错误,这可能表明它sda
本身不太健康。
最好的办法是sdb
立即将其取出并更换。如果您手边没有替换件,请尽快订购一个(也许可以利用这段时间对所有阵列进行最后一次完整备份,因为它们仍然完好无损!)。您的替换驱动器需要进行适当的分区,然后将分区相应地添加到您的四个阵列中。希望一切顺利,所有阵列都能成功重新同步。
但是,如果 Shane 是正确的,并且故障导致的进一步错误sda
阻碍了正确的重新组装/重新同步,那么下一步要尝试的就是拉出sda
它,用旧的(可能仍然是好的)替换它,然后看看你的旧驱动器和新的替换驱动器sdb
的组合是否能成功重新组装和重新同步。sdb
最后,如果以上方法都不起作用,那么最后要尝试的(在完全重建和恢复系统之前)是更换驱动器控制器。我曾见过驱动器控制器失灵并给原本健康的阵列带来问题。测试控制器是否是导致 MD 错误的原因的一种方法是将其中一个“故障”驱动器放入另一台装有已知良好控制器和工具的 Linux 机器中mdadm
。由于所有阵列都是 RAID1,因此任何单个驱动器上的阵列都应该能够组装到可用状态(尽管性能下降),然后您可以检查文件系统、进行备份等等。