我们必须对 VMWare 在我们的 Netapp 上使用的 NFS 数据存储执行实时卷大小调整。调整大小后,我们所有的 Windows VM 都运行正常。但是,我们的一些 Linux VM 出现了问题。
一些 Linux VM 停止响应。重新启动这些 VM 后,我在日志中找不到任何表明存在问题的信息。
然而,我确实在某些虚拟机上发现了此类日志消息:
May 29 14:56:02 rhel6-server-1314 kernel: INFO: task jbd2/dm-0-8:382 blocked for more than 120 seconds.
May 29 14:56:02 rhel6-server-1314 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 29 14:56:02 rhel6-server-1314 kernel: jbd2/dm-0-8 D 0000000000000000 0 382 2 0x00000000
May 29 14:56:02 rhel6-server-1314 kernel: ffff880037ce9c20 0000000000000046 ffff880037ce9be0 ffffffffa00041fc
May 29 14:56:02 rhel6-server-1314 kernel: ffff880037ce9b90 ffffffff81012b59 ffff880037ce9bd0 ffffffff8109b809
May 29 14:56:02 rhel6-server-1314 kernel: ffff880037ce1af8 ffff880037ce9fd8 000000000000f4e8 ffff880037ce1af8
May 29 14:56:02 rhel6-server-1314 kernel: Call Trace:
May 29 14:56:02 rhel6-server-1314 kernel: [<ffffffffa00041fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
... rhel6-server-1314
May 29 14:56:02 rhel6-server-1314 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
May 29 14:56:02 rhel6-server-1314 kernel: INFO: task master:1674 blocked for more than 120 seconds.
May 29 14:56:02 rhel6-server-1314 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 29 14:56:02 rhel6-server-1314 kernel: master D 0000000000000000 0 1674 1 0x00000080
May 29 14:56:02 rhel6-server-1314 kernel: ffff88003d669958 0000000000000086 ffff88003d669918 ffffffffa00041fc
May 29 14:56:02 rhel6-server-1314 kernel: 0000000000000000 ffff880002216028 ffff880002215fc0 ffff88003fac2b78
May 29 14:56:02 rhel6-server-1314 kernel: ffff88003fac30f8 ffff88003d669fd8 000000000000f4e8 ffff88003fac30f8
May 29 14:56:02 rhel6-server-1314 kernel: Call Trace:
May 29 14:56:02 rhel6-server-1314 kernel: [<ffffffffa00041fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
... rhel6-server-1314
May 29 14:56:02 rhel6-server-1314 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
May 29 14:56:02 rhel6-server-1314 kernel: INFO: task pickup:6197 blocked for more than 120 seconds.
May 29 14:56:02 rhel6-server-1314 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 29 14:56:02 rhel6-server-1314 kernel: pickup D 0000000000000000 0 6197 1674 0x00000080
May 29 14:56:02 rhel6-server-1314 kernel: ffff88003da95968 0000000000000086 ffff88003da95928 ffffffffa00041fc
May 29 14:56:02 rhel6-server-1314 kernel: ffff88003da95938 ffff8800022128a0 ffff88003da95908 ffffffff81127ed0
May 29 14:56:02 rhel6-server-1314 kernel: ffff88003d90da78 ffff88003da95fd8 000000000000f4e8 ffff88003d90da78
May 29 14:56:02 rhel6-server-1314 kernel: Call Trace:
May 29 14:56:02 rhel6-server-1314 kernel: [<ffffffffa00041fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
... rhel6-server-1314
May 29 14:56:02 rhel6-server-1314 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
May 29 14:56:02 rhel6-server-1314 kernel: mptscsih: ioc0: attempting task abort! (sc=ffff880037bfd280)
May 29 14:56:02 rhel6-server-1314 kernel: sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 03 14 e8 d0 00 00 18 00
May 29 14:56:02 rhel6-server-1314 kernel: mptscsih: ioc0: WARNING - Issuing Reset from mptscsih_IssueTaskMgmt!! doorbell=0x24000000
May 29 14:56:02 rhel6-server-1314 kernel: mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff880037bfd280)
May 29 14:56:02 rhel6-server-1314 kernel: scsi target2:0:0: Beginning Domain Validation
May 29 14:56:02 rhel6-server-1314 kernel: scsi target2:0:0: Domain Validation skipping write tests
May 29 14:56:02 rhel6-server-1314 kernel: scsi target2:0:0: Ending Domain Validation
May 29 14:56:02 rhel6-server-1314 kernel: scsi target2:0:0: FAST-40 WIDE SCSI 80.0 MB/s ST (25 ns, offset 127)
我的问题:
- 有人知道这是什么原因造成的吗?
- 如果没有,我们还应该在哪里寻找线索?
- 最后,有人知道下次我们调整卷大小时如何缓解这种情况吗?
答案1
我认为,这只是一个 I/O 超时。
我在远程 NFS 数据存储上的 Linux VM 上遇到了这样的问题。NFS 太慢了,我们的一些 Linux VM 将其磁盘切换为只读模式(因此停止响应)。可能是在调整大小期间,您的 NFS 数据存储超载,这导致了问题。重启后 Linux VM 可以正常工作吗?
为了避免此类问题,并稍微提高 Linux 客户机的 I/O 性能,您可以尝试将所有客户的 I/O 调度程序切换为“noop”或“deadline”:
就我而言,即使使用“调度程序修复”,在负载最重的 Linux 客户机上,我们每周都会遇到一次这样的超时问题。为了解决这个问题,我们从 NFS 切换到 iSCSI(您也可以尝试优化 NFS 设置,如“rsize”、“wsize”、MTU 等,但就我而言这还不够),并尽可能减少客户机上的 I/O 操作。
答案2
如果这是 NetApp (或任何其他 NFS 服务器),确保ESXi 主机配置的 NFS 最佳实践已经到位。
对于 NFS 部署,我总是会对 NFS 心跳和超时设置进行一些调整。这可能适用于您的情况。请咨询您的存储工程师,了解针对您的设备的具体建议。