我在访问 RAID 阵列的一部分时遇到了错误,需要帮助来解决这个问题。
历史:几个 RAID 分区位于 4 个磁盘上。四天前,工作站发出一些滴答声,GUI ubuntu 磁盘实用程序显示一些坏扇区,但其他一切都很好。昨天(4 月 17 日,星期四),我们遭遇了电源故障和硬重启。硬重启后,系统启动并挂载了大多数 RAID 分区,但一个大型关键分区(包含 /home)出现输入/输出错误。
bpbrown@eguzki:/$ ls home
ls: cannot access home: Input/output error
bpbrown@eguzki:/$
我们在 Ubuntu 12.04 中,并且由于丢失/home
,我们只能使用命令行。
重启后,mdadm
显示阵列正在重新同步;这似乎已经完成,但仍然无法访问/home
。以下是结果mdadm
:
bpbrown@eguzki:/$ sudo mdadm -D /dev/md10
/dev/md10:
Version : 0.90
Creation Time : Thu Feb 4 16:49:43 2010
Raid Level : raid5
Array Size : 2868879360 (2735.98 GiB 2937.73 GB)
Used Dev Size : 956293120 (911.99 GiB 979.24 GB)
Raid Devices : 4
Total Devices : 4
Preferred Minor : 10
Persistence : Superblock is persistent
Update Time : Fri Apr 19 10:03:46 2013
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 64K
UUID : 317df11d:4e2edc70:fa3efedc:498284d3
Events : 0.2121101
Number Major Minor RaidDevice State
0 8 10 0 active sync /dev/sda10
1 8 26 1 active sync /dev/sdb10
2 8 42 2 active sync /dev/sdc10
3 8 58 3 active sync /dev/sdd10
bpbrown@eguzki:/$
这是 mdstat:
bpbrown@eguzki:/$ cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid10]
md1 : active raid1 sda1[0] sdb1[1] sdc1[2] sdd1[3]
497856 blocks [4/4] [UUUU]
md8 : active raid5 sda8[0] sdb8[1] sdc8[2] sdd8[3]
5301120 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
md6 : active raid5 sda6[0] sdb6[1] sdc6[2] sdd6[3]
20530752 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
md7 : active raid5 sda7[0] sdc7[2] sdd7[3] sdb7[1]
5301120 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
md5 : active raid5 sda5[0] sdd5[3] sdc5[2] sdb5[1]
5301120 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
md10 : active raid5 sda10[0] sdc10[2] sdd10[3] sdb10[1]
2868879360 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
unused devices: <none>
bpbrown@eguzki:/$
卸载并重新安装/dev/md10
似乎没有帮助,尽管我可能错过了正确安装 raid 阵列的步骤。
如果有帮助,以下是内容/etc/fstab
:
bpbrown@eguzki:/$ more /etc/fstab
# /etc/fstab: static file system information.
#
# <file system> <mount point> <type> <options> <dump> <pass>
proc /proc proc defaults 0 0
/dev/md5 / reiserfs relatime 0 1
/dev/md1 /boot reiserfs notail,relatime 0 2
/dev/md10 /home xfs relatime 0 2
/dev/md8 /tmp reiserfs relatime 0 2
/dev/md6 /usr reiserfs relatime 0 2
/dev/md7 /var reiserfs relatime 0 2
/dev/sda9 none swap pri=1 0 0
/dev/sdb9 none swap pri=1 0 0
/dev/sdc9 none swap pri=1 0 0
/dev/sdd9 none swap pri=1 0 0
/dev/scd0 /media/cdrom0 udf,iso9660 user,noauto,exec,utf8 0 0
bpbrown@eguzki:/$
4 月 23 日更新:再次尝试直接挂载文件系统,并收到一条可能有用的错误消息。这是一个简短版本,省略了一些调用跟踪:
bpbrown@eguzki:/$dmesg | tail
[ 788.335968] XFS (md10): Mounting Filesystem
[ 788.516845] XFS (md10): Starting recovery (logdev: internal)
[ 790.082900] XFS: Internal error XFS_WANT_CORRUPTED_GOTO at line 1503 of file /build/buildd/linux-3.2.0/fs/xfs/xfs_alloc.c. Caller 0xffffffffa0226837
[ 790.082905]
[ 790.083004] Pid: 3211, comm: mount Tainted: P O 3.2.0-38-generic #61-Ubuntu
[ 790.083010] Call Trace:
<omitted for brevity>
[ 790.084139] XFS (md10): xfs_do_force_shutdown(0x8) called from line 3729 of file /build/buildd/linux-3.2.0/fs/xfs/xfs_bmap.c. Return address = 0xffffffffa0236e52
[ 790.217602] XFS (md10): Corruption of in-memory data detected. Shutting down filesystem
[ 790.217654] XFS (md10): Please umount the filesystem and rectify the problem(s)
[ 790.217761] XFS (md10): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
[ 790.217775] XFS (md10): xlog_recover_clear_agi_bucket: failed to clear agi 5. Continuing.
<last 2 lines repeat 8 times>
[ 790.388209] XFS (md10): Ending recovery (logdev: internal)
bpbrown@eguzki:/$
提前感谢您对如何继续的任何建议,
--本
答案1
事实证明,这里的根本问题确实是 XFS 文件系统在非正常断电期间损坏。更糟糕的是,XFS 文件系统有一个未解析的日志文件,导致出现以下警告:
bpbrown@eguzki:/$ sudo xfs_check /dev/md10
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed. Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair. If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.
安装仍然失败,所以我们继续xfs_repair -L
。这个方法很快就见效了(不到 5 分钟),尽管/home
出现了可怕的警告,但分区之后立即就可以安装和读取了。
bpbrown@eguzki:/$ sudo xfs_repair -L /dev/md10
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- scan filesystem freespace and inode maps...
agi unlinked bucket 34 is 50978 in ag 1 (inode=536921890)
<...>
Phase 7 - verify and correct link counts...
resetting inode 97329 nlinks from 2 to 3
resetting inode 536921890 nlinks from 0 to 2
done
bpbrown@eguzki:/$
据我们所知,系统运行正常,没有遭受任何关键数据丢失。
Cray 最终为那些像我这样的新手提供了一些有用的文档xfs_check
,xfs_repair
因此我附上了它们的链接,以防其他人第一次遇到这些问题:
http://docs.cray.com/books/S-2377-22/html-S-2377-22/z1029470303.html
欢呼吧,感谢所有阅读本文并提出想法的人,
--本