我有一个 raid5 阵列,每月运行一次检查。配置为检查从 01:00 开始运行 6 小时,然后停止。接下来的晚上,它将恢复检查另外 6 小时,直到完成。
我遇到的问题是,有时当 mdcheck 尝试停止检查运行时,它会挂起。发生这种情况后,您可以从数组中读取,但任何写入尝试都会导致进程挂起。
阵列状态如下:
md0 : active raid5 sdb1[4] sdc1[2] sdd1[5] sde1[1]
8790398976 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
[========>............] check = 44.2% (1296999956/2930132992) finish=216065.8min speed=125K/sec
bitmap: 0/6 pages [0KB], 262144KB chunk
永不check = 44.2% (1296999956/2930132992)
前进或停止。
从脚本来看/usr/share/mdadm/mdcheck
,每隔 2 分钟,直到结束时间,它都会读取/sys/block/md0/md/sync_completed
并保存目录中存储的文件中的位置/var/lib/mdcheck/
。查看该目录,文件就在那里,日期为停止时间前 2 分钟,值为。当前2588437040
值为,表示在停止时间前 2 分钟一切仍在运行。sync_completed
2593999912
运行lsof
该mdcheck
过程可发现以下内容:
mdcheck 23887 root 1w REG 0,21 4096 43388 /sys/devices/virtual/block/md0/md/sync_action
这似乎表明,在尝试停止 6 小时后的检查时,mdcheck 进程挂起了。我通过在终端中运行以下命令来确认这一点:
sudo echo idle >/sys/devices/virtual/block/md0/md/sync_action
这也挂起了。
我发现停止检查的唯一方法是尝试重新启动,这也会挂起,然后关闭电源。
如何在不重新启动的情况下停止/取消挂起 mdcheck(以及阵列),以及如何找出问题的原因(并解决它)?
附加信息:
操作系统:OpenSUSE Leap 15.2
内核:5.3.18-lp152.57-default
连续不间断地运行一致性检查即可成功。
在磁盘上运行扩展自检成功。
更换所有 SATA 电缆没有效果。
相关dmesg
条目:
[ 5.565328] md/raid:md0: device sdb1 operational as raid disk 3
[ 5.565330] md/raid:md0: device sdc1 operational as raid disk 2
[ 5.565331] md/raid:md0: device sdd1 operational as raid disk 0
[ 5.565332] md/raid:md0: device sde1 operational as raid disk 1
[ 5.575520] md/raid:md0: raid level 5 active with 4 out of 4 devices, algorithm 2
[ 5.640309] md0: detected capacity change from 0 to 9001368551424
[53004.024693] md: data-check of RAID array md0
[74605.665890] md: md0: data-check interrupted.
[139404.408605] md: data-check of RAID array md0
[146718.260616] md: md0: data-check done.
[1867115.595820] md: data-check of RAID array md0
输出mdadm --detail /dev/md0
:
Version : 1.2
Creation Time : Sat Nov 7 09:48:15 2020
Raid Level : raid5
Array Size : 8790398976 (8.19 TiB 9.00 TB)
Used Dev Size : 2930132992 (2.73 TiB 3.00 TB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Tue Feb 2 06:59:55 2021
State : active, checking
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Consistency Policy : bitmap
Check Status : 44% complete
Name : neptune:0 (local to host neptune)
UUID : 5dd490df:79bf70fa:b4b530bc:47b30419
Events : 28109
Number Major Minor RaidDevice State
5 8 49 0 active sync /dev/sdd1
1 8 65 1 active sync /dev/sde1
2 8 33 2 active sync /dev/sdc1
4 8 17 3 active sync /dev/sdb1
输出mdadm --examine /dev/sdb1
(所有磁盘本质上相同):
/dev/sdb1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 5dd490df:79bf70fa:b4b530bc:47b30419
Name : neptune:0 (local to host neptune)
Creation Time : Sat Nov 7 09:48:15 2020
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 5860266895 sectors (2.73 TiB 3.00 TB)
Array Size : 8790398976 KiB (8.19 TiB 9.00 TB)
Used Dev Size : 5860265984 sectors (2.73 TiB 3.00 TB)
Data Offset : 264192 sectors
Super Offset : 8 sectors
Unused Space : before=264112 sectors, after=911 sectors
State : clean
Device UUID : a40bb655:70a88240:06dfad1d:f7fcbdca
Internal Bitmap : 8 sectors from superblock
Update Time : Tue Feb 2 06:59:55 2021
Bad Block Log : 512 entries available at offset 16 sectors
Checksum : 42b3d6 - correct
Events : 28109
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 3
Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
答案1
可能是这个错误:
如果这确实是您的问题,那么您可以尝试这个解决方法(首先将 md1 替换为 md0/md2/etc):
echo active | sudo tee /sys/block/md1/md/array_state