我最近向 RAID5 阵列添加了一个新磁盘并开始扩展它。由于心不在焉,我在此重塑过程中重新启动了服务器,因为另一个程序挂起了并阻塞了一些端口。现在想想可能是因为阵列挂起了,但我不能确定。
我使用以下命令启动了增长过程:
$ mdadm --grow --raid-devices=4 /dev/md0
重启后,重塑过程冻结在28%
。我无法再安装阵列、停止它或执行任何其他操作,它似乎已经冻结了。
以下是有关该阵列的一些信息:
# mdadm -D /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Sat Mar 28 17:31:15 2015
Raid Level : raid5
Array Size : 5860063744 (5588.59 GiB 6000.71 GB)
Used Dev Size : 2930031872 (2794.30 GiB 3000.35 GB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Sun Jun 7 11:04:28 2015
State : clean, reshaping
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 256K
Reshape Status : 28% complete
Delta Devices : 1, (3->4)
Name : ocular:0 (local to host ocular)
UUID : e1f7a83b:2e43c552:84d09d04:b1416cb2
Events : 344582
Number Major Minor RaidDevice State
4 8 17 0 active sync /dev/sdb1
1 8 49 1 active sync /dev/sdd1
3 8 65 2 active sync /dev/sde1
5 8 33 3 active sync /dev/sdc1
和
# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sdb1[4] sdc1[5] sde1[3] sdd1[1]
5860063744 blocks super 1.2 level 5, 256k chunk, algorithm 2 [4/4] [UUUU]
[=====>...............] reshape = 28.6% (840259584/2930031872) finish=524064.9min speed=66K/sec
bitmap: 3/22 pages [12KB], 65536KB chunk
unused devices: <none>
尝试挂载阵列时会挂起
# mount /dev/md0 /mnt/storage/
如果我尝试停止阵列,情况也一样
# mdadm -S /dev/md0
我也曾尝试将其扩展到 3 个设备,但它正忙于最后的重塑:
# mdadm --grow /dev/md0 --raid-devices=3
mdadm: /dev/md0 is performing resync/recovery and cannot be reshaped
我尝试将新驱动器标记为故障,看看重塑是否会停止,但无济于事。将其标记为失败是有效的,但什么也没发生。
我也尝试运行检查而不是重塑(因为我读到过某处这解决了类似的问题)但设备正忙
# echo check>/sys/block/md0/md/sync_action
-bash: echo: write error: Device or resource busy
这是什么意思?我现在真的处于非常危险的境地,不知道该怎么办,所以任何帮助我都非常感谢。
编辑
很确定重启不是问题的原因。似乎是重塑过程中出现了一些问题,导致数组挂起。我在 dmesg 中收到以下错误:
[ 360.625322] INFO: task md0_reshape:126 blocked for more than 120 seconds.
[ 360.625351] Not tainted 4.0.4-2-ARCH #1
[ 360.625367] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 360.625394] md0_reshape D ffff88040af57a58 0 126 2 0x00000000
[ 360.625397] ffff88040af57a58 ffff88040cf58000 ffff8800da535b20 00000001642a9888
[ 360.625399] ffff88040af57fd8 ffff8800da429000 ffff8800da429008 ffff8800da429208
[ 360.625401] 0000000096400e00 ffff88040af57a78 ffffffff81576707 ffff8800da429000
[ 360.625403] Call Trace:
[ 360.625410] [<ffffffff81576707>] schedule+0x37/0x90
[ 360.625428] [<ffffffffa0120de9>] get_active_stripe+0x5c9/0x760 [raid456]
[ 360.625432] [<ffffffff810b6c70>] ? wake_atomic_t_function+0x60/0x60
[ 360.625436] [<ffffffffa01246e0>] reshape_request+0x5b0/0x980 [raid456]
[ 360.625439] [<ffffffff81579053>] ? schedule_timeout+0x123/0x250
[ 360.625443] [<ffffffffa011743f>] sync_request+0x28f/0x400 [raid456]
[ 360.625449] [<ffffffffa00da486>] ? is_mddev_idle+0x136/0x170 [md_mod]
[ 360.625454] [<ffffffffa00de4ba>] md_do_sync+0x8ba/0xe70 [md_mod]
[ 360.625457] [<ffffffff81576002>] ? __schedule+0x362/0xa30
[ 360.625462] [<ffffffffa00d9e54>] md_thread+0x144/0x150 [md_mod]
[ 360.625464] [<ffffffff810b6c70>] ? wake_atomic_t_function+0x60/0x60
[ 360.625468] [<ffffffffa00d9d10>] ? md_start_sync+0xf0/0xf0 [md_mod]
[ 360.625471] [<ffffffff81093418>] kthread+0xd8/0xf0
[ 360.625473] [<ffffffff81093340>] ? kthread_worker_fn+0x170/0x170
[ 360.625476] [<ffffffff8157a398>] ret_from_fork+0x58/0x90
[ 360.625478] [<ffffffff81093340>] ? kthread_worker_fn+0x170/0x170
另外,查看 CPU 使用率 md0_raid5 似乎存在问题:
PID USER PR NI VIRT RES %CPU %MEM TIME+ S COMMAND
125 root 20 0 0.0m 0.0m 100.0 0.0 35:57.44 R `- md0_raid5
126 root 20 0 0.0m 0.0m 0.0 0.0 0:00.06 D `- md0_reshape
这可能是重塑停止的原因吗?
是否有可能在不丢失数据的情况下再次恢复使用 3 个驱动器?