来自 LVM 和服务器挂起的奇怪日志消息

来自 LVM 和服务器挂起的奇怪日志消息

我们的一台 OpenSUSE (11.3) 虚拟化服务器最近出现了一些稳定性问题。该服务器负载相当重,最多有 100 个虚拟域,全部在 LVM 快照上运行。我们的服务器突然挂起(仍然响应 ping,无法登录)。

服务器将大量来自 LVM 的消息写入日志,但我不确定这些消息是否正常。这是服务器挂起前的日志示例:

Oct 28 13:26:35 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:26:35 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:27:15 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:27:15 xen lvm[3384]: dm_task_run failed, errno = 22, Invalid argument
Oct 28 13:28:25 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:30:45 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:30:55 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:30:55 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:31:05 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:31:05 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:31:05 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:31:15 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:31:21 xen kernel: [76866.784142] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Oct 28 13:31:21 xen kernel: [76866.784158] sr 2:0:0:0: [sr0] CDB: Get event status notification: 4a 01 00 00 10 00 00 00 08 00
Oct 28 13:31:21 xen kernel: [76866.784183] ata3.00: cmd a0/00:00:00:08:00/00:00:00:00:00/a0 tag 0 pio 16392 in
Oct 28 13:31:21 xen kernel: [76866.784186]          res 40/00:02:00:08:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
Oct 28 13:31:21 xen kernel: [76866.784192] ata3.00: status: { DRDY }
Oct 28 13:31:25 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:31:25 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:31:26 xen kernel: [76871.824061] ata3: link is slow to respond, please be patient (ready=0)
Oct 28 13:31:31 xen kernel: [76876.808057] ata3: device not ready (errno=-16), forcing hardreset
Oct 28 13:31:31 xen kernel: [76876.808080] ata3: soft resetting link
Oct 28 13:31:31 xen kernel: [76876.972455] ata3: nv_mode_filter: 0x739f&0x739f->0x739f, BIOS=0x7000 (0xc0000000) ACPI=0x701f (60:900:0x11)
Oct 28 13:31:31 xen kernel: [76876.988343] ata3.00: configured for UDMA/33
Oct 28 13:31:31 xen kernel: [76876.988971] ata3: EH complete
Oct 28 13:31:35 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:31:45 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:32:05 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:32:35 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:35:55 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:36:05 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 28 13:36:25 xen lvm[3384]: Another thread is handling an event. Waiting...

此时我有四个问题:

  1. 这些日志消息对于 LVM 来说正常吗?
  2. 为什么 dm_task_run 会在日志中打印有关无效参数的错误?
  3. 什么可能导致 SATA 链接重置?
  4. 这些是否暗示了什么可能导致服务器挂起?

编辑:下次挂断时我得到了更多信息:

Oct 30 01:59:25 xen lvm[3384]: Another thread is handling an event. Waiting...
Oct 30 02:01:15 xen kernel: [208260.160037] BUG: soft lockup - CPU#3 stuck for 67s! [qemu-dm:12987]
Oct 30 02:01:15 xen kernel: [208260.160053] BUG: soft lockup - CPU#4 stuck for 67s! [qemu-dm:26429]
Oct 30 02:01:15 xen kernel: [208260.160070] Modules linked in:
Oct 30 02:01:15 xen kernel: [208260.160074] Modules linked in: bridge stp llc bridge usbbk stp llc gntdev usbbk gntdev netbk netbk blkbk blkbk blkback hcd usbcore sd_mod dm_snapshot xenblk ehci_hcd cdrom xennet usbcore edd dm_mod dm_snapshot fan pata_amd xenblk sata_nv libata cdrom scsi_mod thermal xennet processor thermal_sys edd hwmon
Oct 30 02:01:15 xen kernel: [208260.160276]  dm_modCPU 4
Oct 30 02:01:15 xen kernel: [208260.160283] Modules linked in: fan bridge stp pata_amd llc usbbk gntdev netbk sata_nv blkbk blkback_pagemap libata blktap domctl xenbus_be scsi_mod evtchn nfs lockd thermal fscache nfs_acl auth_rpcgss processor sunrpc fuse loop thermal_sys tg3 pcspkr i2c_nforce2 hwmon i2c_core ghes 8250_pnp
Oct 30 02:01:15 xen kernel: [208260.160342] CPU 3  8250
Oct 30 02:01:15 xen kernel: [208260.160346] Modules linked in: shpchp bridge stp llc k10temp usbbk gntdev netbk hed blkbk blkback_pagemap blktap domctl sr_mod xenbus_be evtchn nfs forcedeth pci_hotplug lockd serial_core serio_raw sg fscache nfs_acl button auth_rpcgss sunrpc fuse usbhid hid linear ohci_hcd loop sd_mod ehci_hcd tg3 pcspkr i2c_nforce2 usbcore i2c_core ghes 8250_pnp dm_snapshot 8250 shpchp k10temp xenblk hed sr_mod cdrom forcedeth xennet edd dm_mod pci_hotplug serial_core fan serio_raw sg pata_amd button usbhid hid sata_nv libata linear scsi_mod ohci_hcd sd_mod thermal ehci_hcd processor thermal_sys hwmon usbcore dm_snapshot
Oct 30 02:01:15 xen kernel: [208260.160471]  xenblk
Oct 30 02:01:15 xen kernel: [208260.160477]  cdrom

答案1

我唯一能给出明确答案的问题是第一个问题:这条信息对于我的逻辑卷管理器。

谷歌搜索这条消息,除了此主题,其中某人的 LVM 因驱动器故障而挂起,显然 lvm 无法检测到驱动器因某种原因而发生故障(该主题中的后续帖子讨论了该原因,但我对多路径或 SAN 不够熟悉,无法理解发生了什么)。如果您不使用多路径,它可能不会对您有太大帮助。

根据我的经验,驱动器重置是驱动器故障的标志,但通常在内核决定尝试重置之前,故障驱动器会记录许多超时错误。尝试读取相关驱动器的 SMART 信息,看看驱动器是否认为它有问题?这也可能是驱动器或控制器固件问题,因为第一个日志似乎暗示驱动器突然停止响应,但也许超时错误在 lvm 消息流之前的某个地方丢失了。如果 ata3 不是新驱动器,则固件问题的可能性较小。有趣的是,驱动器在重置后以 UDMA/33 模式(旧 PATA 速度)模式注册,这意味着驱动器未正确重置,这可以解释为什么 LVM 保持挂起状态。

相关内容