使用 vSphere 主机和 NFS 存储时,客户操作系统有时会卡住几分钟

使用 vSphere 主机和 NFS 存储时,客户操作系统有时会卡住几分钟

我有一台 vSphere 主机,分配了两个存储:本地和 200G NFS。Guest 是 Linux,内核为 3.10,2 个内核,4GB 内存,10GB 磁盘存储在 NFS 存储上,文件系统为 EXT4。

vSphere 主机和 NFS 服务器之间的网络不稳定,有时会断开几分钟。当 NFS 网络断开时,Guest 可能会卡住并且无法响应任何命令,即使ps单个命令也enter无法获得新线路反馈。这种情况持续几秒到几分钟,并抛出 CPU 停顿消息。其中一条消息是:

[10952.770359] INFO: rcu_preempt self-detected stall on CPU { 0}  (t=6086 jiffies g=222282 c=222281 q=1695)
[10952.770360] sending NMI to all CPUs:
[10952.770367] NMI backtrace for cpu 0
[10952.770370] CPU: 0 PID: 73 Comm: scsi_eh_0 Tainted: G           O 3.10.20-rt14+ #7
[10952.770371] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/30/2014
[10952.770372] task: ffff88006e762170 ti: ffff88006e7626e0 task.ti: ffff88006e7626e0
[10952.770378] RIP: 0010:[<ffffffff8102eb5d>]  [<ffffffff8102eb5d>] native_apic_mem_write+0xc/0xe
[10952.770379] RSP: 0000:ffff88007fc03d88  EFLAGS: 00000046
[10952.770379] RAX: 0000000000000000 RBX: 0000000000000046 RCX: 0000000000000000
[10952.770380] RDX: 0000000000000200 RSI: 0000000000000c00 RDI: 0000000000000300
[10952.770380] RBP: ffff88007fc03d88 R08: 0000000000000028 R09: 0000000000000000
[10952.770381] R10: 0000000000008f40 R11: 00000000000007a3 R12: 0000000000000002
[10952.770381] R13: 0000000000000c00 R14: 0000000000000003 R15: ffff88007fc0bc10
[10952.770382] FS:  0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[10952.770383] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[10952.770384] CR2: 00007f5f995b8be0 CR3: 000000000155b000 CR4: 00000000000407f0
[10952.770386] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[10952.770388] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[10952.770389] Stack:
[10952.770390]  ffff88007fc03db8 ffffffff8102ecaa 0000000000000002 ffff88007fc0c1e0
[10952.770391]  0000000000000000 ffffffff81951f40 ffff88007fc03dc8 ffffffff8102ed13
[10952.770392]  ffff88007fc03de8 ffffffff8102ef4b 0000000000002710 ffff88007fc0c1e0
[10952.770393] Call Trace:
[10952.770395]  <IRQ> 
[10952.770396]  [<ffffffff8102ecaa>] _flat_send_IPI_mask+0x74/0x9e
[10952.770398]  [<ffffffff8102ed13>] flat_send_IPI_mask+0x11/0x13
[10952.770399]  [<ffffffff8102ef4b>] flat_send_IPI_all+0x24/0x51
[10952.770403]  [<ffffffff8102c49e>] arch_trigger_all_cpu_backtrace+0x4f/0x79
[10952.770406]  [<ffffffff810bfc67>] rcu_check_callbacks+0x1e9/0x552
[10952.770409]  [<ffffffff81059471>] update_process_times+0x86/0xae
[10952.770411]  [<ffffffff8108f9eb>] ? tick_sched_do_timer+0x2f/0x2f
[10952.770413]  [<ffffffff8108f438>] tick_sched_handle+0x4d/0x5c
[10952.770414]  [<ffffffff8108fa25>] tick_sched_timer+0x3a/0x58
[10952.770416]  [<ffffffff8106f2c4>] __run_hrtimer+0x86/0x144
[10952.770418]  [<ffffffff8106fd9c>] hrtimer_interrupt+0x119/0x20f
[10952.770422]  [<ffffffff8154d84c>] smp_apic_timer_interrupt+0x77/0x8a
[10952.770424]  [<ffffffff8154c6af>] apic_timer_interrupt+0x6f/0x80
[10952.770425]  <EOI> 
[10952.770427]  [<ffffffff8154adea>] ? retint_restore_args+0x13/0x13
[10952.770440]  [<ffffffff81344b62>] ? io_serial_in+0x1b/0x20
[10952.770441]  [<ffffffff813449c5>] serial_port_in+0xd/0xf
[10952.770445]  [<ffffffff81346864>] serial8250_poll+0x2c/0xec
[10952.770446]  [<ffffffff81342956>] uartdrv_console_write+0x1ec/0x284
[10952.770450]  [<ffffffff81049bb6>] call_console_drivers.constprop.21+0xcd/0x121
[10952.770452]  [<ffffffff8104a8ff>] console_unlock+0x26e/0x314
[10952.770454]  [<ffffffff8104ae04>] vprintk_emit+0x45f/0x4f4
[10952.770458]  [<ffffffff815413de>] printk+0x54/0x56
[10952.770459]  [<ffffffff815493c6>] ? preempt_schedule+0x3c/0x61
[10952.770469]  [<ffffffff813ee471>] ata_dev_printk+0x65/0x67
[10952.770470]  [<ffffffff815497a7>] ? rt_spin_lock_slowlock+0x48/0x261
[10952.770473]  [<ffffffff813f99b4>] ata_eh_report+0x18a/0x8fe
[10952.770475]  [<ffffffff8109426b>] ? lock_release+0x176/0x1c2
[10952.770477]  [<ffffffff81093e7b>] ? lock_acquire+0xb5/0x112
[10952.770478]  [<ffffffff815497a7>] ? rt_spin_lock_slowlock+0x48/0x261
[10952.770480]  [<ffffffff813f7ab4>] ? speed_down_verdict_cb+0x2b/0x3f
[10952.770481]  [<ffffffff813f8484>] ? ata_ering_map+0x3f/0x5f
[10952.770482]  [<ffffffff813f9670>] ? ata_eh_link_autopsy+0x566/0x62d
[10952.770486]  [<ffffffff813fdd4e>] ? ata_sff_dev_classify+0xcc/0xcc
[10952.770488]  [<ffffffff813fdd4e>] ? ata_sff_dev_classify+0xcc/0xcc
[10952.770489]  [<ffffffff813fc27e>] ata_do_eh+0x30/0x98
[10952.770491]  [<ffffffff81405a58>] ? pci_write_config_word+0x19/0x19
[10952.770492]  [<ffffffff813fdd4e>] ? ata_sff_dev_classify+0xcc/0xcc
[10952.770494]  [<ffffffff813fdeb2>] ? ata_sff_softreset+0x164/0x164
[10952.770495]  [<ffffffff813fe08e>] ata_sff_error_handler+0xe6/0xef
[10952.770497]  [<ffffffff813fe42f>] ata_bmdma_error_handler+0xf0/0xf7
[10952.770498]  [<ffffffff813fbe2f>] ata_scsi_port_error_handler+0x25d/0x5b5
[10952.770499]  [<ffffffff813fc225>] ata_scsi_error+0x9e/0xc7
[10952.770507]  [<ffffffff8138077d>] scsi_error_handler+0xa0/0x3e3
[10952.770511]  [<ffffffff81075f8e>] ? need_resched+0x31/0x3d
[10952.770513]  [<ffffffff815490c7>] ? __schedule+0x45d/0x4a4
[10952.770514]  [<ffffffff813806dd>] ? scsi_eh_get_sense+0xa7/0xa7
[10952.770519]  [<ffffffff8106bfc6>] kthread+0xa2/0xaa
[10952.770521]  [<ffffffff8106bf24>] ? __kthread_parkme+0x65/0x65
[10952.770523]  [<ffffffff8154b962>] ret_from_fork+0x72/0xa0
[10952.770525]  [<ffffffff8106bf24>] ? __kthread_parkme+0x65/0x65
[10952.770536] Code: 45 00 31 c0 5b 41 5c 41 5d 41 5e 5d c3 83 c8 ff c3 90 55 48 89 e5 57 9d 0f 1f 44 00 00 5d c3 55 89 ff 48 89 e5 89 b7 00 a0 5f ff <5d> c3 55 89 ff 8b 87 00 a0 5f ff 48 89 e5 5d c3 55 48 8b 05 23 
[10952.770537] NMI backtrace for cpu 1
[10952.770540] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G           O 3.10.20-rt14+ #7
[10952.770540] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/30/2014
[10952.770541] task: ffff880076a10000 ti: ffff880076a10570 task.ti: ffff880076a10570
[10952.770550] RIP: 0010:[<ffffffff8103220b>]  [<ffffffff8103220b>] native_safe_halt+0x6/0x8
[10952.770551] RSP: 0000:ffff8800769fdee0  EFLAGS: 00000246
[10952.770552] RAX: 00000000ffffffed RBX: ffff880076a10570 RCX: 00000000ffffffff
[10952.770552] RDX: 0100000000000000 RSI: 0000000000000001 RDI: ffffffff810162e7
[10952.770553] RBP: ffff8800769fdee0 R08: 0000000000000000 R09: 0000000000000000
[10952.770554] R10: 0000000000000001 R11: 000000000000b44f R12: ffff880076a10570
[10952.770554] R13: ffff880076a10570 R14: ffff880076a10570 R15: 0000000000000000
[10952.770555] FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
[10952.770556] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[10952.770556] CR2: 0000000001b6ab80 CR3: 00000000112c6000 CR4: 00000000000407f0
[10952.770560] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[10952.770561] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[10952.770562] Stack:
[10952.770563]  ffff8800769fdef0 ffffffff810162ec ffff8800769fdf00 ffffffff81016a5b
[10952.770564]  ffff8800769fdf30 ffffffff810872db 0000000000000000 0000000000000000
[10952.770565]  0000000000000000 0000000000000000 ffff8800769fdf48 ffffffff8153aca9
[10952.770566] Call Trace:
[10952.770573]  [<ffffffff810162ec>] default_idle+0x25/0x39
[10952.770577]  [<ffffffff81016a5b>] arch_cpu_idle+0x18/0x26
[10952.770580]  [<ffffffff810872db>] cpu_startup_entry+0x123/0x180
[10952.770585]  [<ffffffff8153aca9>] start_secondary+0x246/0x248
[10952.770596] Code: 48 89 e5 0f 09 5d c3 55 48 89 e5 9c 58 5d c3 55 48 89 e5 57 9d 5d c3 55 48 89 e5 fa 5d c3 55 48 89 e5 fb 5d c3 55 48 89 e5 fb f4 <5d> c3 55 48 89 e5 f4 5d c3 55 49 89 ca 49 89 d1 8b 07 48 89 e5 

和这个:

[10056.112139] ata1: link is slow to respond, please be patient (ready=0)
[10061.126051] ata1: device not ready (errno=-16), forcing hardreset
[10061.127471] ata1: soft resetting link
[10061.686537] ata1.00: configured for PIO0
[10061.687470] ata1.00: device reported invalid CHS sector 0
[10061.688712] ata1: EH complete
[10092.060204] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[10092.061797] ata1.00: failed command: WRITE MULTIPLE
[10092.062901] ata1.00: cmd c5/00:20:f0:b4:1f/00:00:00:00:00/e0 tag 0 pio 16384 out
[10092.062901]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

我需要 NFS 网络断开时 Guest 响应命令,您有什么建议吗?

相关内容