我有一台 vSphere 主机,分配了两个存储:本地和 200G NFS。Guest 是 Linux,内核为 3.10,2 个内核,4GB 内存,10GB 磁盘存储在 NFS 存储上,文件系统为 EXT4。
vSphere 主机和 NFS 服务器之间的网络不稳定,有时会断开几分钟。当 NFS 网络断开时,Guest 可能会卡住并且无法响应任何命令,即使ps
单个命令也enter
无法获得新线路反馈。这种情况持续几秒到几分钟,并抛出 CPU 停顿消息。其中一条消息是:
[10952.770359] INFO: rcu_preempt self-detected stall on CPU { 0} (t=6086 jiffies g=222282 c=222281 q=1695)
[10952.770360] sending NMI to all CPUs:
[10952.770367] NMI backtrace for cpu 0
[10952.770370] CPU: 0 PID: 73 Comm: scsi_eh_0 Tainted: G O 3.10.20-rt14+ #7
[10952.770371] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/30/2014
[10952.770372] task: ffff88006e762170 ti: ffff88006e7626e0 task.ti: ffff88006e7626e0
[10952.770378] RIP: 0010:[<ffffffff8102eb5d>] [<ffffffff8102eb5d>] native_apic_mem_write+0xc/0xe
[10952.770379] RSP: 0000:ffff88007fc03d88 EFLAGS: 00000046
[10952.770379] RAX: 0000000000000000 RBX: 0000000000000046 RCX: 0000000000000000
[10952.770380] RDX: 0000000000000200 RSI: 0000000000000c00 RDI: 0000000000000300
[10952.770380] RBP: ffff88007fc03d88 R08: 0000000000000028 R09: 0000000000000000
[10952.770381] R10: 0000000000008f40 R11: 00000000000007a3 R12: 0000000000000002
[10952.770381] R13: 0000000000000c00 R14: 0000000000000003 R15: ffff88007fc0bc10
[10952.770382] FS: 0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[10952.770383] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[10952.770384] CR2: 00007f5f995b8be0 CR3: 000000000155b000 CR4: 00000000000407f0
[10952.770386] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[10952.770388] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[10952.770389] Stack:
[10952.770390] ffff88007fc03db8 ffffffff8102ecaa 0000000000000002 ffff88007fc0c1e0
[10952.770391] 0000000000000000 ffffffff81951f40 ffff88007fc03dc8 ffffffff8102ed13
[10952.770392] ffff88007fc03de8 ffffffff8102ef4b 0000000000002710 ffff88007fc0c1e0
[10952.770393] Call Trace:
[10952.770395] <IRQ>
[10952.770396] [<ffffffff8102ecaa>] _flat_send_IPI_mask+0x74/0x9e
[10952.770398] [<ffffffff8102ed13>] flat_send_IPI_mask+0x11/0x13
[10952.770399] [<ffffffff8102ef4b>] flat_send_IPI_all+0x24/0x51
[10952.770403] [<ffffffff8102c49e>] arch_trigger_all_cpu_backtrace+0x4f/0x79
[10952.770406] [<ffffffff810bfc67>] rcu_check_callbacks+0x1e9/0x552
[10952.770409] [<ffffffff81059471>] update_process_times+0x86/0xae
[10952.770411] [<ffffffff8108f9eb>] ? tick_sched_do_timer+0x2f/0x2f
[10952.770413] [<ffffffff8108f438>] tick_sched_handle+0x4d/0x5c
[10952.770414] [<ffffffff8108fa25>] tick_sched_timer+0x3a/0x58
[10952.770416] [<ffffffff8106f2c4>] __run_hrtimer+0x86/0x144
[10952.770418] [<ffffffff8106fd9c>] hrtimer_interrupt+0x119/0x20f
[10952.770422] [<ffffffff8154d84c>] smp_apic_timer_interrupt+0x77/0x8a
[10952.770424] [<ffffffff8154c6af>] apic_timer_interrupt+0x6f/0x80
[10952.770425] <EOI>
[10952.770427] [<ffffffff8154adea>] ? retint_restore_args+0x13/0x13
[10952.770440] [<ffffffff81344b62>] ? io_serial_in+0x1b/0x20
[10952.770441] [<ffffffff813449c5>] serial_port_in+0xd/0xf
[10952.770445] [<ffffffff81346864>] serial8250_poll+0x2c/0xec
[10952.770446] [<ffffffff81342956>] uartdrv_console_write+0x1ec/0x284
[10952.770450] [<ffffffff81049bb6>] call_console_drivers.constprop.21+0xcd/0x121
[10952.770452] [<ffffffff8104a8ff>] console_unlock+0x26e/0x314
[10952.770454] [<ffffffff8104ae04>] vprintk_emit+0x45f/0x4f4
[10952.770458] [<ffffffff815413de>] printk+0x54/0x56
[10952.770459] [<ffffffff815493c6>] ? preempt_schedule+0x3c/0x61
[10952.770469] [<ffffffff813ee471>] ata_dev_printk+0x65/0x67
[10952.770470] [<ffffffff815497a7>] ? rt_spin_lock_slowlock+0x48/0x261
[10952.770473] [<ffffffff813f99b4>] ata_eh_report+0x18a/0x8fe
[10952.770475] [<ffffffff8109426b>] ? lock_release+0x176/0x1c2
[10952.770477] [<ffffffff81093e7b>] ? lock_acquire+0xb5/0x112
[10952.770478] [<ffffffff815497a7>] ? rt_spin_lock_slowlock+0x48/0x261
[10952.770480] [<ffffffff813f7ab4>] ? speed_down_verdict_cb+0x2b/0x3f
[10952.770481] [<ffffffff813f8484>] ? ata_ering_map+0x3f/0x5f
[10952.770482] [<ffffffff813f9670>] ? ata_eh_link_autopsy+0x566/0x62d
[10952.770486] [<ffffffff813fdd4e>] ? ata_sff_dev_classify+0xcc/0xcc
[10952.770488] [<ffffffff813fdd4e>] ? ata_sff_dev_classify+0xcc/0xcc
[10952.770489] [<ffffffff813fc27e>] ata_do_eh+0x30/0x98
[10952.770491] [<ffffffff81405a58>] ? pci_write_config_word+0x19/0x19
[10952.770492] [<ffffffff813fdd4e>] ? ata_sff_dev_classify+0xcc/0xcc
[10952.770494] [<ffffffff813fdeb2>] ? ata_sff_softreset+0x164/0x164
[10952.770495] [<ffffffff813fe08e>] ata_sff_error_handler+0xe6/0xef
[10952.770497] [<ffffffff813fe42f>] ata_bmdma_error_handler+0xf0/0xf7
[10952.770498] [<ffffffff813fbe2f>] ata_scsi_port_error_handler+0x25d/0x5b5
[10952.770499] [<ffffffff813fc225>] ata_scsi_error+0x9e/0xc7
[10952.770507] [<ffffffff8138077d>] scsi_error_handler+0xa0/0x3e3
[10952.770511] [<ffffffff81075f8e>] ? need_resched+0x31/0x3d
[10952.770513] [<ffffffff815490c7>] ? __schedule+0x45d/0x4a4
[10952.770514] [<ffffffff813806dd>] ? scsi_eh_get_sense+0xa7/0xa7
[10952.770519] [<ffffffff8106bfc6>] kthread+0xa2/0xaa
[10952.770521] [<ffffffff8106bf24>] ? __kthread_parkme+0x65/0x65
[10952.770523] [<ffffffff8154b962>] ret_from_fork+0x72/0xa0
[10952.770525] [<ffffffff8106bf24>] ? __kthread_parkme+0x65/0x65
[10952.770536] Code: 45 00 31 c0 5b 41 5c 41 5d 41 5e 5d c3 83 c8 ff c3 90 55 48 89 e5 57 9d 0f 1f 44 00 00 5d c3 55 89 ff 48 89 e5 89 b7 00 a0 5f ff <5d> c3 55 89 ff 8b 87 00 a0 5f ff 48 89 e5 5d c3 55 48 8b 05 23
[10952.770537] NMI backtrace for cpu 1
[10952.770540] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G O 3.10.20-rt14+ #7
[10952.770540] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/30/2014
[10952.770541] task: ffff880076a10000 ti: ffff880076a10570 task.ti: ffff880076a10570
[10952.770550] RIP: 0010:[<ffffffff8103220b>] [<ffffffff8103220b>] native_safe_halt+0x6/0x8
[10952.770551] RSP: 0000:ffff8800769fdee0 EFLAGS: 00000246
[10952.770552] RAX: 00000000ffffffed RBX: ffff880076a10570 RCX: 00000000ffffffff
[10952.770552] RDX: 0100000000000000 RSI: 0000000000000001 RDI: ffffffff810162e7
[10952.770553] RBP: ffff8800769fdee0 R08: 0000000000000000 R09: 0000000000000000
[10952.770554] R10: 0000000000000001 R11: 000000000000b44f R12: ffff880076a10570
[10952.770554] R13: ffff880076a10570 R14: ffff880076a10570 R15: 0000000000000000
[10952.770555] FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
[10952.770556] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[10952.770556] CR2: 0000000001b6ab80 CR3: 00000000112c6000 CR4: 00000000000407f0
[10952.770560] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[10952.770561] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[10952.770562] Stack:
[10952.770563] ffff8800769fdef0 ffffffff810162ec ffff8800769fdf00 ffffffff81016a5b
[10952.770564] ffff8800769fdf30 ffffffff810872db 0000000000000000 0000000000000000
[10952.770565] 0000000000000000 0000000000000000 ffff8800769fdf48 ffffffff8153aca9
[10952.770566] Call Trace:
[10952.770573] [<ffffffff810162ec>] default_idle+0x25/0x39
[10952.770577] [<ffffffff81016a5b>] arch_cpu_idle+0x18/0x26
[10952.770580] [<ffffffff810872db>] cpu_startup_entry+0x123/0x180
[10952.770585] [<ffffffff8153aca9>] start_secondary+0x246/0x248
[10952.770596] Code: 48 89 e5 0f 09 5d c3 55 48 89 e5 9c 58 5d c3 55 48 89 e5 57 9d 5d c3 55 48 89 e5 fa 5d c3 55 48 89 e5 fb 5d c3 55 48 89 e5 fb f4 <5d> c3 55 48 89 e5 f4 5d c3 55 49 89 ca 49 89 d1 8b 07 48 89 e5
和这个:
[10056.112139] ata1: link is slow to respond, please be patient (ready=0)
[10061.126051] ata1: device not ready (errno=-16), forcing hardreset
[10061.127471] ata1: soft resetting link
[10061.686537] ata1.00: configured for PIO0
[10061.687470] ata1.00: device reported invalid CHS sector 0
[10061.688712] ata1: EH complete
[10092.060204] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[10092.061797] ata1.00: failed command: WRITE MULTIPLE
[10092.062901] ata1.00: cmd c5/00:20:f0:b4:1f/00:00:00:00:00/e0 tag 0 pio 16384 out
[10092.062901] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
我需要 NFS 网络断开时 Guest 响应命令,您有什么建议吗?