我们在 Windows 7 机器上的 VirtualBox 中运行 CentOS 版本 6.3 时遇到了问题。症状如下:
几个小时甚至几天内一切都正常。然后某物发生破坏系统的情况。
发生此事后我们还能做什么:
- 访问 Web 服务器
- 使用现有的 SSH 会话运行 top 和 free
什么不起作用:
- 启动新的 SSH 会话(输入用户名和密码后挂起)
- 在现有 SSH 会话中运行 ls(挂起)
- SSI 包括从我们的 Web 服务器获取远程机器的数据
- 可能更多
当发生这种情况时,我们在服务器上看到以下内容:
- 平均负载从基本为零上升到 3 左右
- CPU 使用率仍然很低 (5%)
- 磁盘活动低(运行 iostat)
- 大量可用内存
- 有足够的可用磁盘空间
在 /var/log/messages 中我们得到以下内容:
Jun 14 01:10:48 devvm kernel: e1000 0000:00:03.0: eth0: Detected Tx Unit Hang
Jun 14 01:10:48 devvm kernel: Tx Queue <0>
Jun 14 01:10:48 devvm kernel: TDH <2e>
Jun 14 01:10:48 devvm kernel: TDT <30>
Jun 14 01:10:48 devvm kernel: next_to_use <30>
Jun 14 01:10:48 devvm kernel: next_to_clean <2e>
Jun 14 01:10:48 devvm kernel: buffer_info[next_to_clean]
Jun 14 01:10:48 devvm kernel: time_stamp <1038284db>
Jun 14 01:10:48 devvm kernel: next_to_watch <2f>
Jun 14 01:10:48 devvm kernel: jiffies <103828b42>
Jun 14 01:10:48 devvm kernel: next_to_watch.status <0>
Jun 14 01:10:50 devvm kernel: e1000 0000:00:03.0: eth0: Detected Tx Unit Hang
Jun 14 01:10:50 devvm kernel: Tx Queue <0>
Jun 14 01:10:50 devvm kernel: TDH <2e>
Jun 14 01:10:50 devvm kernel: TDT <30>
Jun 14 01:10:50 devvm kernel: next_to_use <30>
Jun 14 01:10:50 devvm kernel: next_to_clean <2e>
Jun 14 01:10:50 devvm kernel: buffer_info[next_to_clean]
Jun 14 01:10:50 devvm kernel: time_stamp <1038284db>
Jun 14 01:10:50 devvm kernel: next_to_watch <2f>
Jun 14 01:10:50 devvm kernel: jiffies <103829312>
Jun 14 01:10:50 devvm kernel: next_to_watch.status <0>
Jun 14 01:10:52 devvm kernel: ------------[ cut here ]------------
Jun 14 01:10:52 devvm kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
Jun 14 01:10:52 devvm kernel: Hardware name: VirtualBox
Jun 14 01:10:52 devvm kernel: NETDEV WATCHDOG: eth0 (e1000): transmit queue 0 timed out
Jun 14 01:10:52 devvm kernel: Modules linked in: vboxsf(U) ipv6 ppdev parport_pc parport microcode sg vboxguest(U) i2c_piix4 i2c_core e1000 snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc pcnet32 mii ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Jun 14 01:10:52 devvm kernel: Pid: 0, comm: swapper Not tainted 2.6.32-279.el6.x86_64 #1
Jun 14 01:10:52 devvm kernel: Call Trace:
Jun 14 01:10:52 devvm kernel: <IRQ> [<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0
Jun 14 01:10:52 devvm kernel: [<ffffffff8106b836>] ? warn_slowpath_fmt+0x46/0x50
Jun 14 01:10:52 devvm kernel: [<ffffffff814595fd>] ? dev_watchdog+0x26d/0x280
Jun 14 01:10:52 devvm kernel: [<ffffffff81099138>] ? sched_clock_cpu+0xb8/0x110
Jun 14 01:10:52 devvm kernel: [<ffffffff81459390>] ? dev_watchdog+0x0/0x280
Jun 14 01:10:52 devvm kernel: [<ffffffff8107e897>] ? run_timer_softirq+0x197/0x340
Jun 14 01:10:52 devvm kernel: [<ffffffff810a21c0>] ? tick_sched_timer+0x0/0xc0
Jun 14 01:10:52 devvm kernel: [<ffffffff8102b40d>] ? lapic_next_event+0x1d/0x30
Jun 14 01:10:52 devvm kernel: [<ffffffff81073ec1>] ? __do_softirq+0xc1/0x1e0
Jun 14 01:10:52 devvm kernel: [<ffffffff81096c50>] ? hrtimer_interrupt+0x140/0x250
Jun 14 01:10:52 devvm kernel: [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30
Jun 14 01:10:52 devvm kernel: [<ffffffff8100de85>] ? do_softirq+0x65/0xa0
Jun 14 01:10:52 devvm kernel: [<ffffffff81073ca5>] ? irq_exit+0x85/0x90
Jun 14 01:10:52 devvm kernel: [<ffffffff81505be0>] ? smp_apic_timer_interrupt+0x70/0x9b
Jun 14 01:10:52 devvm kernel: [<ffffffff8100bc13>] ? apic_timer_interrupt+0x13/0x20
Jun 14 01:10:52 devvm kernel: <EOI> [<ffffffff810387cb>] ? native_safe_halt+0xb/0x10
Jun 14 01:10:52 devvm kernel: [<ffffffff810149cd>] ? default_idle+0x4d/0xb0
Jun 14 01:10:52 devvm kernel: [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
Jun 14 01:10:52 devvm kernel: [<ffffffff814e433a>] ? rest_init+0x7a/0x80
Jun 14 01:10:52 devvm kernel: [<ffffffff81c21f7b>] ? start_kernel+0x424/0x430
Jun 14 01:10:52 devvm kernel: [<ffffffff81c2133a>] ? x86_64_start_reservations+0x125/0x129
Jun 14 01:10:52 devvm kernel: [<ffffffff81c21438>] ? x86_64_start_kernel+0xfa/0x109
Jun 14 01:10:52 devvm kernel: ---[ end trace 2c7bb984812cf120 ]---
Jun 14 01:10:52 devvm kernel: e1000 0000:00:03.0: eth0: Reset adapter
Jun 14 01:10:53 devvm abrtd: Directory 'oops-2013-06-14-01:10:53-1537-0' creation detected
Jun 14 01:10:53 devvm abrt-dump-oops: Reported 1 kernel oopses to Abrt
Jun 14 01:10:53 devvm abrtd: Can't open file '/var/spool/abrt/oops-2013-06-14-01:10:53-1537-0/uid': No such file or directory
Jun 14 01:10:55 devvm kernel: Bridge firewalling registered
此后,我们每隔两分钟就会看到一段时间:
Jun 14 01:14:22 devvm kernel: INFO: task events/0:19 blocked for more than 120 seconds.
Jun 14 01:14:22 devvm kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 14 01:14:22 devvm kernel: events/0 D 0000000000000000 0 19 2 0x00000000
Jun 14 01:14:22 devvm kernel: ffff880116c4fb90 0000000000000046 00000000ffffffff 0000000000000008
Jun 14 01:14:22 devvm kernel: 0000000000016680 0000000000016680 ffff880028210400 0000000000016680
Jun 14 01:14:22 devvm kernel: ffff880116c4daf8 ffff880116c4ffd8 000000000000fb88 ffff880116c4daf8
Jun 14 01:14:22 devvm kernel: Call Trace:
Jun 14 01:14:22 devvm kernel: [<ffffffff8105b483>] ? perf_event_task_sched_out+0x33/0x80
Jun 14 01:14:22 devvm kernel: [<ffffffff814fe6a5>] schedule_timeout+0x215/0x2e0
Jun 14 01:14:22 devvm kernel: [<ffffffff8100975d>] ? __switch_to+0x13d/0x320
Jun 14 01:14:22 devvm kernel: [<ffffffff814fe323>] wait_for_common+0x123/0x180
Jun 14 01:14:22 devvm kernel: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
Jun 14 01:14:22 devvm kernel: [<ffffffff814fe43d>] wait_for_completion+0x1d/0x20
Jun 14 01:14:22 devvm kernel: [<ffffffff8108d093>] __cancel_work_timer+0x1b3/0x1e0
Jun 14 01:14:22 devvm kernel: [<ffffffff8108cbe0>] ? wq_barrier_func+0x0/0x20
Jun 14 01:14:22 devvm kernel: [<ffffffff8108d0f0>] cancel_work_sync+0x10/0x20
Jun 14 01:14:22 devvm kernel: [<ffffffffa01c5ca5>] e1000_down_and_stop+0x25/0x50 [e1000]
Jun 14 01:14:22 devvm kernel: [<ffffffffa01cb695>] e1000_down+0x155/0x200 [e1000]
Jun 14 01:14:22 devvm kernel: [<ffffffffa01cbcb0>] ? e1000_reset_task+0x0/0xe0 [e1000]
Jun 14 01:14:22 devvm kernel: [<ffffffffa01cbd1e>] e1000_reset_task+0x6e/0xe0 [e1000]
Jun 14 01:14:22 devvm kernel: [<ffffffff8108c760>] worker_thread+0x170/0x2a0
Jun 14 01:14:22 devvm kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
Jun 14 01:14:22 devvm kernel: [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
Jun 14 01:14:22 devvm kernel: [<ffffffff81091d66>] kthread+0x96/0xa0
Jun 14 01:14:22 devvm kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
Jun 14 01:14:22 devvm kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
Jun 14 01:14:22 devvm kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
Jun 14 01:14:22 devvm kernel: INFO: task parted:8069 blocked for more than 120 seconds.
Jun 14 01:14:22 devvm kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 14 01:14:22 devvm kernel: parted D 0000000000000003 0 8069 7994 0x00000080
Jun 14 01:14:22 devvm kernel: ffff8800908b3bb8 0000000000000082 0000000000000000 ffff88010ab50080
Jun 14 01:14:22 devvm kernel: ffff880116c7d500 0000000000000001 0000000000000000 0000000000000000
Jun 14 01:14:22 devvm kernel: ffff88010ab50638 ffff8800908b3fd8 000000000000fb88 ffff88010ab50638
Jun 14 01:14:22 devvm kernel: Call Trace:
Jun 14 01:14:22 devvm kernel: [<ffffffff814fe6a5>] schedule_timeout+0x215/0x2e0
Jun 14 01:14:22 devvm kernel: [<ffffffff814fe323>] wait_for_common+0x123/0x180
Jun 14 01:14:22 devvm kernel: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
Jun 14 01:14:22 devvm kernel: [<ffffffff8112b6d0>] ? lru_add_drain_per_cpu+0x0/0x10
Jun 14 01:14:22 devvm kernel: [<ffffffff814fe43d>] wait_for_completion+0x1d/0x20
Jun 14 01:14:22 devvm kernel: [<ffffffff8108d177>] flush_work+0x77/0xc0
Jun 14 01:14:22 devvm kernel: [<ffffffff8108cbe0>] ? wq_barrier_func+0x0/0x20
Jun 14 01:14:22 devvm kernel: [<ffffffff8108d2f3>] schedule_on_each_cpu+0x133/0x180
Jun 14 01:14:22 devvm kernel: [<ffffffff811ad440>] ? invalidate_bh_lru+0x0/0x50
Jun 14 01:14:22 devvm kernel: [<ffffffff8112ae35>] lru_add_drain_all+0x15/0x20
Jun 14 01:14:22 devvm kernel: [<ffffffff811adf6a>] invalidate_bdev+0x2a/0x50
Jun 14 01:14:22 devvm kernel: [<ffffffff8125e9a4>] blkdev_ioctl+0x3b4/0x6e0
Jun 14 01:14:22 devvm kernel: [<ffffffff811b381c>] block_ioctl+0x3c/0x40
Jun 14 01:14:22 devvm kernel: [<ffffffff8118dec2>] vfs_ioctl+0x22/0xa0
Jun 14 01:14:22 devvm kernel: [<ffffffff8118e064>] do_vfs_ioctl+0x84/0x580
Jun 14 01:14:22 devvm kernel: [<ffffffff8118e5e1>] sys_ioctl+0x81/0xa0
Jun 14 01:14:22 devvm kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
在/var/spool/abrt/oops-2013-06-14-01:10:53-1537-0中我们可以看到以下信息:
在回溯中:
WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
Hardware name: VirtualBox
NETDEV WATCHDOG: eth0 (e1000): transmit queue 0 timed out
Modules linked in: vboxsf(U) ipv6 ppdev parport_pc parport microcode sg vboxguest(U) i2c_piix4 i2c_core e1000 snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc pcnet32 mii ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 0, comm: swapper Not tainted 2.6.32-279.el6.x86_64 #1
Call Trace:
<IRQ> [<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0
[<ffffffff8106b836>] ? warn_slowpath_fmt+0x46/0x50
[<ffffffff814595fd>] ? dev_watchdog+0x26d/0x280
[<ffffffff81099138>] ? sched_clock_cpu+0xb8/0x110
[<ffffffff81459390>] ? dev_watchdog+0x0/0x280
[<ffffffff8107e897>] ? run_timer_softirq+0x197/0x340
[<ffffffff810a21c0>] ? tick_sched_timer+0x0/0xc0
[<ffffffff8102b40d>] ? lapic_next_event+0x1d/0x30
[<ffffffff81073ec1>] ? __do_softirq+0xc1/0x1e0
[<ffffffff81096c50>] ? hrtimer_interrupt+0x140/0x250
[<ffffffff8100c24c>] ? call_softirq+0x1c/0x30
[<ffffffff8100de85>] ? do_softirq+0x65/0xa0
[<ffffffff81073ca5>] ? irq_exit+0x85/0x90
[<ffffffff81505be0>] ? smp_apic_timer_interrupt+0x70/0x9b
[<ffffffff8100bc13>] ? apic_timer_interrupt+0x13/0x20
<EOI> [<ffffffff810387cb>] ? native_safe_halt+0xb/0x10
[<ffffffff810149cd>] ? default_idle+0x4d/0xb0
[<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
[<ffffffff814e433a>] ? rest_init+0x7a/0x80
[<ffffffff81c21f7b>] ? start_kernel+0x424/0x430
[<ffffffff81c2133a>] ? x86_64_start_reservations+0x125/0x129
[<ffffffff81c21438>] ? x86_64_start_kernel+0xfa/0x109
在命令行中:
ro root=/dev/mapper/vg_01-lv_root rd_NO_LUKS LANG=en_US.UTF-8 KEYBOARDTYPE=pc KEYTABLE=sv-latin1 rd_NO_MD SYSFONT=latarcyrheb-sun16 rd_LVM_LV=vg_01/lv_root crashkernel=129M@0M rhgb quiet rd_LVM_LV=vg_01/lv_swap rd_NO_DM rhgb quie
附加信息:
# uname -a
Linux devvm 2.6.32-279.el6.x86_64 #1 SMP Fri Jun 22 12:19:21 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
# cat /etc/redhat-release
CentOS release 6.3 (Final)
VirtualBox 版本 4.2.6。
任何有关如何继续排除故障的见解都将不胜感激。如果您需要更多信息,请告诉我。
答案1
尝试切换 NIC 类型。可能是某个驱动程序存在错误。