基于 KVM LVM 的客户机...内核:设备上的缓冲区 I/O 错误。驱动器故障?

基于 KVM LVM 的客户机...内核:设备上的缓冲区 I/O 错误。驱动器故障?

目前正在为一家小型企业设置一台小型 KVM 主机来运行一些虚拟机。该服务器在软件 md RAID 1 中有 2 个驱动器,然后我在 LVM 设置中将其设置为 PV。客户机和主机都是 CentOS 6.4 64 位。

KVM 客户的分区/是磁盘映像,但对于一个具有更高 i/o 要求的特定客户,我向 VM 添加了第二个 HDD,它是来自主机存储池的逻辑卷。

今天晚上,我在客户机中对该 LV 运行了一些非常密集的 I/O,提取了一个 60GB 的多卷 7z 数据存档。7z 在使用过程中大约 1/5 就出错了E_FAIL。我尝试移动此LV磁盘上的一些文件,但收到“无法移动...只读文件系统”的提示。所有设备都已安装rw。我查看/var/log/messages并看到以下错误...

Nov 22 21:47:52 mail kernel: Buffer I/O error on device vdb1, logical block 47307631
Nov 22 21:47:52 mail kernel: lost page write due to I/O error on vdb1
Nov 22 21:47:52 mail kernel: Buffer I/O error on device vdb1, logical block 47307632
Nov 22 21:47:52 mail kernel: lost page write due to I/O error on vdb1
Nov 22 21:47:52 mail kernel: Buffer I/O error on device vdb1, logical block 47307633
Nov 22 21:47:55 mail kernel: end_request: I/O error, dev vdb, sector 378473448
Nov 22 21:47:55 mail kernel: end_request: I/O error, dev vdb, sector 378474456
Nov 22 21:47:55 mail kernel: end_request: I/O error, dev vdb, sector 378475464
Nov 22 21:47:55 mail kernel: JBD: Detected IO errors while flushing file data on vdb1
Nov 22 21:47:55 mail kernel: end_request: I/O error, dev vdb, sector 255779688
Nov 22 21:47:55 mail kernel: Aborting journal on device vdb1.
Nov 22 21:47:55 mail kernel: end_request: I/O error, dev vdb, sector 255596560
Nov 22 21:47:55 mail kernel: JBD: I/O error detected when updating journal superblock for vdb1.
Nov 22 21:48:06 mail kernel: __ratelimit: 20 callbacks suppressed
Nov 22 21:48:06 mail kernel: __ratelimit: 2295 callbacks suppressed
Nov 22 21:48:06 mail kernel: Buffer I/O error on device vdb1, logical block 47270479
Nov 22 21:48:06 mail kernel: lost page write due to I/O error on vdb1
Nov 22 21:48:06 mail kernel: Buffer I/O error on device vdb1, logical block 47271504
Nov 22 21:48:06 mail kernel: end_request: I/O error, dev vdb, sector 378116680
Nov 22 21:48:06 mail kernel: end_request: I/O error, dev vdb, sector 378157680
Nov 22 21:48:06 mail kernel: end_request: I/O error, dev vdb, sector 378432440
Nov 22 21:51:25 mail kernel: EXT3-fs (vdb1): error: ext3_journal_start_sb: Detected aborted journal
Nov 22 21:51:25 mail kernel: EXT3-fs (vdb1): error: remounting filesystem read-only
Nov 22 21:51:55 mail kernel: __ratelimit: 35 callbacks suppressed
Nov 22 21:51:55 mail kernel: __ratelimit: 35 callbacks suppressed
Nov 22 21:51:55 mail kernel: Buffer I/O error on device vdb1, logical block 64003824
Nov 22 21:51:55 mail kernel: Buffer I/O error on device vdb1, logical block 64003839
Nov 22 21:51:55 mail kernel: Buffer I/O error on device vdb1, logical block 256
Nov 22 21:51:55 mail kernel: Buffer I/O error on device vdb1, logical block 32
Nov 22 21:51:55 mail kernel: Buffer I/O error on device vdb1, logical block 64
Nov 22 21:51:55 mail kernel: end_request: I/O error, dev vdb, sector 6144
Nov 22 21:55:06 mail yum[19139]: Installed: lsof-4.82-4.el6.x86_64
Nov 22 21:59:47 mail kernel: __ratelimit: 1 callbacks suppressed
Nov 22 22:00:01 mail kernel: __ratelimit: 1 callbacks suppressed
Nov 22 22:00:01 mail kernel: Buffer I/O error on device vdb1, logical block 64003824
Nov 22 22:00:01 mail kernel: Buffer I/O error on device vdb1, logical block 512

还有很多,完整摘录在这里http://pastebin.com/vH8SDrCg
请注意“更新日志超级块”时出现 I/O 错误,然后由于日志中止,该卷被重新安装为只读。

现在该看看主持人了。

  • cat /proc/mdstat返回UU两个 RAID 1 阵列(boot和主 PV)。

  • mdadm --detail分别显示state: cleanstate: active

  • 典型的 LVM 命令pvsvgslvs返回以下错误:

  /dev/VolGroup00/lv_mail: read failed after 0 of 4096 at 262160711680: Input/output error
  /dev/VolGroup00/lv_mail: read failed after 0 of 4096 at 262160769024: Input/output error
  /dev/VolGroup00/lv_mail: read failed after 0 of 4096 at 0: Input/output error
  /dev/VolGroup00/lv_mail: read failed after 0 of 4096 at 4096: Input/output error
  VG         #PV #LV #SN Attr   VSize   VFree  
  VolGroup00   1   4   1 wz--n- 930.75g 656.38g
  • /var/log/messages主机上显示的内容如下:
Nov 22 21:47:53 localhost kernel: device-mapper: snapshots: Invalidating snapshot: Unable to allocate exception.
Nov 22 22:11:04 localhost kernel: Buffer I/O error on device dm-3, logical block 0
Nov 22 22:11:04 localhost kernel: Buffer I/O error on device dm-3, logical block 1
Nov 22 22:11:04 localhost kernel: Buffer I/O error on device dm-3, logical block 2
Nov 22 22:11:04 localhost kernel: Buffer I/O error on device dm-3, logical block 3
Nov 22 22:11:04 localhost kernel: Buffer I/O error on device dm-3, logical block 0
Nov 22 22:11:04 localhost kernel: Buffer I/O error on device dm-3, logical block 64004095
Nov 22 22:11:04 localhost kernel: Buffer I/O error on device dm-3, logical block 64004095
Nov 22 22:11:04 localhost kernel: Buffer I/O error on device dm-3, logical block 0
  • 简短的自检没有smartctl发现任何物理磁盘上的问题。SMART 数据中也没有令人担忧的错误计数器,0除了通电时间、旋转时间和温度外,大多数都是。甚至通电时间也相对较短,大约 150 天左右。我目前正在进行长时间的自检。

那么,基于所有这些,这有多大可能是驱动器故障的开始?值得在主机上
运行fsck或吗? 我不想在这个阶段引起完全的内核恐慌。我本以为现在应该会显示一个失败的阵列成员,大约在事件发生后 1 小时。badblocks
mdstat

这台机器是专用服务器,因此我没有物理访问权限。我很快就会通过 DRAC 检查控制台,但我预计控制台上会出现一堆 I/O 错误。我没有虚拟媒体访问权限,因此无法加载 systemrescuecd 进行修复,因此我目前对重新启动有点谨慎。

相关内容