我如何诊断“冻结”的 Linux 软件 RAID 设备?

我如何诊断“冻结”的 Linux 软件 RAID 设备?

我有一台运行 Linux 3.2.12 32 位 i686 的服务器,该服务器有 13 个驱动器:1 个启动驱动器和 3 个 raid5 设备,每个设备有 4 个驱动器。

/proc/mdstat 显示

Personalities : [raid1] [raid10] [raid6] [raid5] [raid4] 
md2 : active raid5 sdd1[3] sdc1[2] sdb1[1] sda1[0]
    5860535808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md1 : active raid5 sdk1[3] sdj1[2] sdi1[1] sdh1[0]
    4395407808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md3 : active raid5 sdl1[0] sdm1[1] sdf1[3] sde1[2]
    5860535808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

unused devices: <none>

我的问题是,三天内第二次,其中一个 RAID 驱动器导致任何试图从中读取数据的进程锁定。没有信号能够终止这些进程,我必须重新启动才能使其重新工作。但是,重新启动后驱动器似乎正常,RAID 状态似乎正常,内核日志除了进程挂起之外没有任何有用的错误消息。

我已经在所有有问题的驱动器上运行了 smartctl,它们看起来很好。

我还可以检查什么来尝试诊断这个问题?

以下是内核日志的摘录,看起来有些意思。但请注意,“无法将 ioctl 发送到分区”一直存在,搜索结果显示这是一个无害的警告。

每 900 秒:

...
Aug 20 18:34:01 [kernel] [  931.249505] mdadm: sending ioctl 1261 to a partition!
Aug 20 18:49:01 [kernel] [ 1831.302297] scsi_verify_blk_ioctl: 2 callbacks suppressed
Aug 20 18:49:01 [kernel] [ 1831.302300] mdadm: sending ioctl 1261 to a partition!
Aug 20 18:49:01 [kernel] [ 1831.302302] mdadm: sending ioctl 1261 to a partition!
Aug 20 18:49:01 [kernel] [ 1831.302774] mdadm: sending ioctl 1261 to a partition!
Aug 20 18:49:01 [kernel] [ 1831.302776] mdadm: sending ioctl 1261 to a partition!
Aug 20 18:49:02 [kernel] [ 1831.333538] mdadm: sending ioctl 1261 to a partition!
Aug 20 18:49:02 [kernel] [ 1831.333540] mdadm: sending ioctl 1261 to a partition!
Aug 20 18:49:02 [kernel] [ 1831.358068] mdadm: sending ioctl 1261 to a partition!
Aug 20 18:49:02 [kernel] [ 1831.358071] mdadm: sending ioctl 1261 to a partition!
Aug 20 18:49:02 [kernel] [ 1831.414331] mdadm: sending ioctl 1261 to a partition!
Aug 20 18:49:02 [kernel] [ 1831.414334] mdadm: sending ioctl 1261 to a partition!
Aug 20 19:04:01 [kernel] [ 2731.070794] scsi_verify_blk_ioctl: 2 callbacks suppressed
...

问题出现的时间:

Aug 21 13:38:32 [kernel] [69601.312055] INFO: task kjournald:26008 blocked for more than 600 seconds.
Aug 21 13:38:32 [kernel] [69601.312057] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 21 13:38:32 [kernel] [69601.312059] kjournald       D 00000000     0 26008      2 0x00000000
Aug 21 13:38:32 [kernel] [69601.312063]  eb5ccc80 00000046 00000000 00000000 00000000 e81e0070 e81e020c f6205900
Aug 21 13:38:32 [kernel] [69601.312068]  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Aug 21 13:38:32 [kernel] [69601.312072]  00000000 00000000 00000000 00000000 00000000 00000001 c0b66230 e81e0280
Aug 21 13:38:32 [kernel] [69601.312077] Call Trace:
Aug 21 13:38:32 [kernel] [69601.312083]  [<c013cbe5>] ? prepare_to_wait+0x15/0x55
Aug 21 13:38:32 [kernel] [69601.312088]  [<c0217df5>] ? journal_commit_transaction+0xdb/0xca6
Aug 21 13:38:32 [kernel] [69601.312090]  [<c013ca68>] ? wake_up_bit+0x16/0x16
Aug 21 13:38:32 [kernel] [69601.312093]  [<c0132c3d>] ? lock_timer_base+0x19/0x35
Aug 21 13:38:32 [kernel] [69601.312095]  [<c0132cb8>] ? try_to_del_timer_sync+0x5f/0x65
Aug 21 13:38:32 [kernel] [69601.312098]  [<c021ade6>] ? kjournald+0xa6/0x1a2
Aug 21 13:38:32 [kernel] [69601.312101]  [<c013ca68>] ? wake_up_bit+0x16/0x16
Aug 21 13:38:32 [kernel] [69601.312103]  [<c021ad40>] ? journal_grab_journal_head+0x31/0x31
Aug 21 13:38:32 [kernel] [69601.312106]  [<c013c778>] ? kthread+0x65/0x6a
Aug 21 13:38:32 [kernel] [69601.312108]  [<c013c713>] ? kthread_stop+0x47/0x47
Aug 21 13:38:32 [kernel] [69601.312111]  [<c0830b36>] ? kernel_thread_helper+0x6/0xd

答案1

首先升级你的内核。该内核包含一个错误这导致各种 ioctl 在某些 mdraid 和 LVM 配置中打印这些警告(并且可能失败)。

如果修复内核无法解决问题,请对所有驱动器运行扩展自检。请注意,每个驱动器的自检可能需要几个小时,并且在运行时会略微降低性能,因此应在系统活动较少时运行。例如,要安排自检在晚上 11 点开始:

at 11 pm <<JOB
for drive in /dev/sd?
do
    smartctl -t long $drive || :
done
JOB

第二天晚些时候,检查测试结果:

for drive in /dev/sd?
do
    echo Test results for drive $drive
    smartctl -l selftest $drive || :
done

如果内核更新没有解决问题,那么您可能会发现驱动器自检失败。

如果你找到自检失败的驱动器,无论如何都要检查驱动器属性。

for drive in /dev/sd?
do
    echo Attributes for drive $drive
    smartctl -A $drive || :
done

请注意,即使没有标记为失败,其中一些属性也可能表示存在问题;因此请找专家检查它们,或将它们附加到您的问题中。

相关内容