我有一个 4 驱动器 RAID 10 阵列,刚刚发生驱动器故障。我从未实践过如何从故障中恢复(我是一名程序员,只是出于业余爱好保留这台服务器),所以我现在不得不艰难地学习这一点。
我通过 Google 和这个网站(谢谢你们!)设法弄清楚了如何使新驱动器失败、删除、添加和重新同步,但它在重新同步期间不断失败,并且只是将新磁盘标记为备用。
通过更多的 Google 搜索和更多的命令行,我发现剩下的“好”驱动器实际上有一些坏扇区在同步期间产生读取错误,因此 mdadm 正在中止并标记为备用。
我曾经badblocks
确认过坏扇区的存在(似乎有不少),但我不知道这些扇区是否真的在使用中(到目前为止,我还没有发现任何损坏的数据)。我也读过可以fsck
修复这些数据的方法,但我也读过,这种方法也有可能彻底搞砸整个驱动器。因此,我还没有尝试过。
我尝试使用 mdadm 的--force
标志在重新同步期间忽略这些错误,但似乎没有任何帮助。
我已经备份了所有关键数据,但还有大量非关键数据需要备份。真的如果可以避免,我宁愿不失去(一切都可以替代,但需要很长时间)。此外,我的关键备份都在云端,因此即使恢复这些备份很容易,也会非常耗时。
此外,如果需要的话,我手头上还有一个未使用的新替换驱动器。
以下是我所知道的有关系统的所有信息。如果您需要更多信息,请告诉我!如何完全重建此阵列?
驱动器布局
sda
+ sdb
= RAID1A ( md126
)
sdc
+ sdd
= RAID1B ( md127
)
md126
+ md127
= RAID10 ( md125
)
问题阵列是md126
,新的未同步驱动器是sdb
,问题驱动器是sda
root@vault:~# cat /proc/mdstat
Personalities : [raid1] [raid0] [linear] [multipath] [raid6] [raid5] [raid4] [raid10]
md125 : active raid0 md126p1[1] md127p1[0]
5860528128 blocks super 1.2 512k chunks
md126 : active raid1 sda1[1] sdb1[2](S)
2930265390 blocks super 1.2 [2/1] [U_]
md127 : active raid1 sdc1[1] sdd1[0]
2930265390 blocks super 1.2 [2/2] [UU]
unused devices: <none>
root@vault:~# parted -l
Model: ATA ST3000DM001-9YN1 (scsi)
Disk /dev/sda: 3001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
1 17.4kB 3001GB 3001GB RAID: RAID1A raid
Model: ATA ST3000DM001-9YN1 (scsi)
Disk /dev/sdb: 3001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
1 17.4kB 3001GB 3001GB RAID: RAID1A raid
Model: ATA ST3000DM001-1CH1 (scsi)
Disk /dev/sdc: 3001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
1 17.4kB 3001GB 3001GB RAID: RAID1B raid
Model: ATA ST3000DM001-9YN1 (scsi)
Disk /dev/sdd: 3001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
1 17.4kB 3001GB 3001GB RAID: RAID1B raid
root@vault:~# sudo mdadm --detail /dev/md126
/dev/md126:
Version : 1.2
Creation Time : Thu Nov 29 19:09:32 2012
Raid Level : raid1
Array Size : 2930265390 (2794.52 GiB 3000.59 GB)
Used Dev Size : 2930265390 (2794.52 GiB 3000.59 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Thu Jun 2 11:53:44 2016
State : clean, degraded
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1
Name : :RAID1A
UUID : 49293460:3199d164:65a039d6:a212a25e
Events : 5200173
Number Major Minor RaidDevice State
1 8 1 0 active sync /dev/sda1
2 0 0 2 removed
2 8 17 - spare /dev/sdb1
编辑:以下是失败恢复过程中内核日志的内容。
root@vault:~# mdadm --assemble --update=resync --force /dev/md126 /dev/sda1 /dev/sdb1
root@vault:~# tail -f /var/log/kern.log
Jun 5 12:37:57 vault kernel: [151562.172914] RAID1 conf printout:
Jun 5 12:37:57 vault kernel: [151562.172917] --- wd:1 rd:2
Jun 5 12:37:57 vault kernel: [151562.172919] disk 0, wo:0, o:1, dev:sda1
Jun 5 12:37:57 vault kernel: [151562.172921] disk 1, wo:1, o:1, dev:sdb1
Jun 5 12:37:57 vault kernel: [151562.173858] md: recovery of RAID array md126
Jun 5 12:37:57 vault kernel: [151562.173861] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Jun 5 12:37:57 vault kernel: [151562.173863] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Jun 5 12:37:57 vault kernel: [151562.173865] md: using 128k window, over a total of 2930265390k.
Jun 5 12:37:57 vault kernel: [151562.248457] md126: p1
Jun 5 12:37:58 vault kernel: [151562.376906] md: bind<md126p1>
Jun 5 13:21:52 vault kernel: [154196.675777] ata3.00: exception Emask 0x0 SAct 0xffe00 SErr 0x0 action 0x0
Jun 5 13:21:52 vault kernel: [154196.675782] ata3.00: irq_stat 0x40000008
Jun 5 13:21:52 vault kernel: [154196.675785] ata3.00: failed command: READ FPDMA QUEUED
Jun 5 13:21:52 vault kernel: [154196.675791] ata3.00: cmd 60/00:48:a2:a4:e0/05:00:38:00:00/40 tag 9 ncq 655360 in
Jun 5 13:21:52 vault kernel: [154196.675791] res 41/40:00:90:a7:e0/00:05:38:00:00/00 Emask 0x409 (media error) <F>
Jun 5 13:21:52 vault kernel: [154196.675794] ata3.00: status: { DRDY ERR }
Jun 5 13:21:52 vault kernel: [154196.675797] ata3.00: error: { UNC }
Jun 5 13:21:52 vault kernel: [154196.695048] ata3.00: configured for UDMA/133
Jun 5 13:21:52 vault kernel: [154196.695077] sd 2:0:0:0: [sda] tag#9 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 5 13:21:52 vault kernel: [154196.695081] sd 2:0:0:0: [sda] tag#9 Sense Key : Medium Error [current] [descriptor]
Jun 5 13:21:52 vault kernel: [154196.695085] sd 2:0:0:0: [sda] tag#9 Add. Sense: Unrecovered read error - auto reallocate failed
Jun 5 13:21:52 vault kernel: [154196.695090] sd 2:0:0:0: [sda] tag#9 CDB: Read(16) 88 00 00 00 00 00 38 e0 a4 a2 00 00 05 00 00 00
Jun 5 13:21:52 vault kernel: [154196.695092] blk_update_request: I/O error, dev sda, sector 954247056
Jun 5 13:21:52 vault kernel: [154196.695111] ata3: EH complete
Jun 5 13:21:55 vault kernel: [154199.675248] ata3.00: exception Emask 0x0 SAct 0x1000000 SErr 0x0 action 0x0
Jun 5 13:21:55 vault kernel: [154199.675252] ata3.00: irq_stat 0x40000008
Jun 5 13:21:55 vault kernel: [154199.675255] ata3.00: failed command: READ FPDMA QUEUED
Jun 5 13:21:55 vault kernel: [154199.675261] ata3.00: cmd 60/08:c0:8a:a7:e0/00:00:38:00:00/40 tag 24 ncq 4096 in
Jun 5 13:21:55 vault kernel: [154199.675261] res 41/40:08:90:a7:e0/00:00:38:00:00/00 Emask 0x409 (media error) <F>
Jun 5 13:21:55 vault kernel: [154199.675264] ata3.00: status: { DRDY ERR }
Jun 5 13:21:55 vault kernel: [154199.675266] ata3.00: error: { UNC }
Jun 5 13:21:55 vault kernel: [154199.676454] ata3.00: configured for UDMA/133
Jun 5 13:21:55 vault kernel: [154199.676463] sd 2:0:0:0: [sda] tag#24 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 5 13:21:55 vault kernel: [154199.676467] sd 2:0:0:0: [sda] tag#24 Sense Key : Medium Error [current] [descriptor]
Jun 5 13:21:55 vault kernel: [154199.676471] sd 2:0:0:0: [sda] tag#24 Add. Sense: Unrecovered read error - auto reallocate failed
Jun 5 13:21:55 vault kernel: [154199.676474] sd 2:0:0:0: [sda] tag#24 CDB: Read(16) 88 00 00 00 00 00 38 e0 a7 8a 00 00 00 08 00 00
Jun 5 13:21:55 vault kernel: [154199.676477] blk_update_request: I/O error, dev sda, sector 954247056
Jun 5 13:21:55 vault kernel: [154199.676485] md/raid1:md126: sda: unrecoverable I/O read error for block 954244864
Jun 5 13:21:55 vault kernel: [154199.676488] ata3: EH complete
Jun 5 13:21:55 vault kernel: [154199.676597] md: md126: recovery interrupted.
Jun 5 13:21:55 vault kernel: [154199.855992] RAID1 conf printout:
Jun 5 13:21:55 vault kernel: [154199.855995] --- wd:1 rd:2
Jun 5 13:21:55 vault kernel: [154199.855998] disk 0, wo:0, o:1, dev:sda1
Jun 5 13:21:55 vault kernel: [154199.856000] disk 1, wo:1, o:1, dev:sdb1
Jun 5 13:21:55 vault kernel: [154199.872013] RAID1 conf printout:
Jun 5 13:21:55 vault kernel: [154199.872016] --- wd:1 rd:2
Jun 5 13:21:55 vault kernel: [154199.872018] disk 0, wo:0, o:1, dev:sda1
答案1
关键是
Jun 5 13:21:55 vault kernel: [154199.676477] blk_update_request: I/O error, dev sda, sector 954247056
Jun 5 13:21:55 vault kernel: [154199.676485] md/raid1:md126: sda: unrecoverable I/O read error for block 954244864
Jun 5 13:21:55 vault kernel: [154199.676488] ata3: EH complete
Jun 5 13:21:55 vault kernel: [154199.676597] md: md126: recovery interrupted.
要恢复md126
内核,需要复制sda1
到sdb1
- 但它在 sda 上出现读取错误,而 sda 是该镜像中唯一幸存的一半。这个阵列完蛋了。是时候使用 shotgun/dev/sda
并从备份中恢复了(或者,如果没有备份,在 shotgun 之前从现有阵列中保存尽可能多的内容)。
编辑:如果你想从故障驱动器中获取数据,可以使用以下工具安全拷贝可能会有用(免责声明:我与作者或项目没有任何关系)。
答案2
如果你足够勇敢(或绝望)的话,也许可以尝试一下这个家伙的方法:http://matrafox.info/how-to-get-md-raid-array-to-rebuild-even-if-read-errors.html
他的黑客通过向坏块写入无意义的内容来强制将坏块重新分配到“好”磁盘上。这对您的数据来说无关紧要,因为有问题的扇区无论如何都是死的,磁盘控制器会将坏扇区重新映射到备用扇区。
使用风险自负!