昨天,我的托管服务提供商更换了我的一个硬盘的 SATA 电缆。当我的服务器再次启动时,cat /proc/mdstat
显示以下内容:
Personalities : [raid1]
md124 : active raid1 sda1[0]
4193268 blocks super 1.2 [2/1] [U_]
md125 : active (auto-read-only) raid1 sda2[0]
524276 blocks super 1.2 [2/1] [U_]
md126 : active (auto-read-only) raid1 sda3[0]
268434296 blocks super 1.2 [2/1] [U_]
md127 : active raid1 sda4[0]
2657109311 blocks super 1.2 [2/1] [U_]
md3 : active (auto-read-only) raid1 sdb4[1]
2657109311 blocks super 1.2 [2/1] [_U]
md2 : active raid1 sdb3[1]
268434296 blocks super 1.2 [2/1] [_U]
md1 : active (auto-read-only) raid1 sdb2[1]
524276 blocks super 1.2 [2/1] [_U]
md0 : active (auto-read-only) raid1 sdb1[1]
4193268 blocks super 1.2 [2/1] [_U]
我启动了救援控制台,发现所有阵列都已降级。
md3 : active (auto-read-only) raid1 sdb4[1]
2657109311 blocks super 1.2 [2/1] [_U]
md2 : active raid1 sdb3[1]
268434296 blocks super 1.2 [2/1] [_U]
md1 : active (auto-read-only) raid1 sdb2[1]
524276 blocks super 1.2 [2/1] [_U]
md0 : active (auto-read-only) raid1 sdb1[1]
4193268 blocks super 1.2 [2/1] [_U]
然后,我将丢失的驱动器添加到每个阵列:
mdadm /dev/md0 -a /dev/sda1
mdadm /dev/md1 -a /dev/sda2
mdadm /dev/md2 -a /dev/sda3
mdadm /dev/md3 -a /dev/sda4
接下来,阵列开始恢复。完成后,我重新启动进入正常系统,然后恢复再次开始。
这次/dev/sdb
被标记为缺失:
Personalities : [raid1]
md3 : active raid1 sda4[2] sdb4[3]
2657109311 blocks super 1.2 [2/1] [U_]
[===>.................] recovery = 17.1% (456317824/2657109311) finish=288.2min speed=127254K/sec
恢复在 3 小时后停止,现在驱动器被标记为备用:
md3 : active raid1 sda4[2] sdb4[3](S)
2657109311 blocks super 1.2 [2/1] [U_]
md2 : active raid1 sda3[2] sdb3[1]
268434296 blocks super 1.2 [2/2] [UU]
md1 : active raid1 sda2[2] sdb2[1]
524276 blocks super 1.2 [2/2] [UU]
md0 : active raid1 sda1[2] sdb1[1]
4193268 blocks super 1.2 [2/2] [UU]
到目前为止我没有丢失任何数据——我检查了自己的电子邮件帐户,在服务器关闭之前我收到的每封电子邮件都还在那里,而三天前硬盘出现了故障。
如何将备用磁盘/dev/md3
再次添加到我的 RAID 阵列?
我发现了另一个与我的问题类似的问题/答案 这里。这样做安全吗?或者我会丢失数据吗?:
mdadm --grow /dev/md3 --raid-devices=3
mdadm /dev/md3 --fail /dev/{failed drive}
mdadm /dev/md3 --remove /dev/{failed drive}
mdadm --grow /dev/md3 --raid-devices=2
当然我有备份,但如果可以避免使用它们,我愿意这样做。
编辑:我刚刚发现一个读取错误,dmesg
可能是在驱动器发生故障之前发生的,并被标记为备用:
[17699.328298] ata1.00: irq_stat 0x40000008
[17699.328324] ata1.00: failed command: READ FPDMA QUEUED
[17699.328356] ata1.00: cmd 60/08:00:80:d8:05/00:00:ff:00:00/40 tag 0 ncq 4096 in
[17699.328358] res 51/40:08:80:d8:05/00:00:ff:00:00/40 Emask 0x409 (media error) <F>
[17699.328446] ata1.00: status: { DRDY ERR }
[17699.328471] ata1.00: error: { UNC }
[17699.332240] ata1.00: configured for UDMA/133
[17699.332281] sd 0:0:0:0: [sda] Unhandled sense code
[17699.332308] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[17699.332342] sd 0:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor]
[17699.332384] Descriptor sense data with sense descriptors (in hex):
[17699.332415] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
[17699.332491] ff 05 d8 80
[17699.332528] sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
[17699.332581] sd 0:0:0:0: [sda] CDB: Read(10): 28 00 ff 05 d8 80 00 00 08 00
[17699.332648] end_request: I/O error, dev sda, sector 4278573184
[17699.332689] ata1: EH complete
[17699.332737] raid1: sda: unrecoverable I/O read error for block 3732258944
[17699.377132] md: md3: recovery done.
我之前用以下方法测试过该驱动器smartctl
:
smartctl -l selftest /dev/sda
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 3444 -
[code]
[code]
smartctl -l selftest /dev/sdb
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 3444
但munin
显示smartctl
退出代码为 64 并smartctl -l error /dev/sda
显示:
=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
ATA Error Count: 552 (device log contains only the most recent five errors)
......
Error 552 occurred at disk power-on lifetime: 3444 hours (143 days + 12 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 80 d8 05 0f
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 00 80 d8 05 40 00 20:56:57.342 READ FPDMA QUEUED
ef 10 02 00 00 00 a0 00 20:56:57.342 SET FEATURES [Reserved for Serial ATA]
27 00 00 00 00 00 e0 00 20:56:57.342 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 20:56:57.340 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 20:56:57.340 SET FEATURES [Set transfer mode]
Error 551 occurred at disk power-on lifetime: 3444 hours (143 days + 12 hours)
When the command that caused the error occurred, the device was active or idle.
....
编辑#2:
mdadm --examine /dev/sdb4
/dev/sdb4:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 38dec3bf:770fb774:6e9a28d0:ff3eac4a
Name : rescue:3
Creation Time : Tue Feb 26 21:21:56 2013
Raid Level : raid1
Raid Devices : 2
Avail Dev Size : 5314218895 (2534.02 GiB 2720.88 GB)
Array Size : 5314218622 (2534.02 GiB 2720.88 GB)
Used Dev Size : 5314218622 (2534.02 GiB 2720.88 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 83caa70a:6fe627f8:5a9a22d4:54a457f8
Update Time : Tue Jul 9 23:08:37 2013
Checksum : 7a729887 - correct
Events : 3478472
Device Role : spare
Array State : A. ('A' == active, '.' == missing)
我的硬盘刚刚更换。
Personalities : [raid1]
md2 : active raid1 sdb3[1]
268434296 blocks super 1.2 [2/1] [_U]
md1 : active raid1 sdb2[1]
524276 blocks super 1.2 [2/1] [_U]
md0 : active (auto-read-only) raid1 sdb1[1]
4193268 blocks super 1.2 [2/1] [_U]
我没有使用工具来恢复数据,因为我很确定数据/dev/sdb
是最新的直到我的服务器重新启动了,阵列坏了,所以我只是将分区表从/dev/sdb
复制到/dev/sda
并重建了阵列。
copy partitions
sgdisk -R /dev/sda /dev/sdb
mix ids
sgdisk -G /dev/sda
recreate array
--create /dev/md3 --level=1 --raid-devices=2 /dev/sdb4 missing
mdadm /dev/md3 -a /dev/sda3
好吧,我希望这次重建能够完成。
答案1
我会犹豫是否要增加数组。你不想要更大的数组,所以这是错误的操作。这可能是实现相同目标的迂回方法,但我发现除非没有其他方法,否则坚持使用预期的运算符是一种很好的理念。
尝试:
sudo mdadm manage /dev/md3 --remove /dev/sdb4
sudo mdadm manage /dev/md3 --re-add /dev/sdb4
并dmesg
在重建时观察 /dev/sda 或 /dev/sdb 上的读/写错误。
看起来/dev/sda
有坏扇区/dev/sda4
。您应该更换驱动器。如果/dev/sdb
SMART 状态显示良好,最简单的方法是
- 获取一个新的驱动器(我假设它会显示为
/dev/sdc
) - 重新分区如下
/dev/sda
- 然后一个接一个地失败
/dev/sdaX
,取而代之/dev/sdcX
- 让数组从
/dev/sdb
for重建md0
-md2
md3
将会很特殊,因为目前mdadm
它不被视为/dev/sdb4
数组。
您可以尝试使用gddrescue
恢复/dev/sda4
到/dev/sdc4
,然后在完成后尝试组装/dev/md3
:
sudo mdadm --assemble /dev/md3 /dev/sdc4 /dev/sdb4
看看它是否能启动。如果能启动,fsck
文件系统将检查错误,然后再次删除/重新添加 sdb4 以开始重新同步。您将有一些文件损坏/丢失/损坏,需要从备份中恢复这些文件。
如果您无法获得 的良好副本/dev/sda4
,/dev/sdc4
则只需从/dev/sdc4
和创建一个新阵列/dev/sdb4
并从备份中恢复全部内容。