我们在云网络上有一台由第三方提供的服务器。我们正在运行 Ubuntu 10.04 服务器版。
问题似乎总是随机发生,每天大约一到三次。top 中的平均负载通常约为 2,服务器运行良好,但在这些随机时间,平均负载会飙升至 30-35 左右,一切都会陷入停顿。无法访问我们的网站,无法在服务器上执行命令,无法执行任何操作。如果您尚未登录,甚至无法登录。
我们能够看到高平均负载的唯一方法是不断运行 top,这样当问题发生时 top 已经在运行了。似乎如果它已经在运行,它将继续正常工作,但如果它没有运行,你就无法启动它。当它进入这种状态时无法运行任何命令,这让我们很难诊断问题……而且我们并不认为自己是服务器专家。
我觉得奇怪的是,平均负载峰值如此之高,但处理器却处于闲置状态,而且有足够的可用内存。再说一遍,我根本不是专家,但我的基本理解是,如果内存可用且处理器没有达到最大容量,那么就不应该有进程在等待(很可能我搞错了)。
当我输入这段代码时,我抓住了它,因为它开始飙升,并在一切都锁定之前设法运行了一些命令。输出如下:
uname -a
Linux <server name> 2.6.32-308-ec2 #16-Ubuntu SMP Thu Sep 16 14:28:38 UTC 2010 i686 GNU/Linux
顶部
top - 10:55:08 up 15:28, 4 users, load average: 12.29, 7.01, 3.89
Tasks: 313 total, 3 running, 308 sleeping, 0 stopped, 2 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 4210704k total, 2163024k used, 2047680k free, 162320k buffers
Swap: 2096440k total, 0k used, 2096440k free, 1690464k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 2804 1644 1204 S 0 0.0 0:00.18 init
2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd
3 root RT 0 0 0 0 R 0 0.0 0:00.08 migration/0
4 root 20 0 0 0 0 S 0 0.0 0:00.01 ksoftirqd/0
5 root RT 0 0 0 0 R 0 0.0 0:00.01 watchdog/0
6 root 20 0 0 0 0 S 0 0.0 0:00.06 events/0
ps 压缩文件
PID TTY STAT TIME COMMAND
2 ? S 0:00 [kthreadd]
3 ? R 0:00 \_ [migration/0]
4 ? S 0:00 \_ [ksoftirqd/0]
5 ? R 0:00 \_ [watchdog/0]
6 ? S 0:00 \_ [events/0]
7 ? S 0:00 \_ [cpuset]
8 ? S 0:00 \_ [khelper]
<removed a bunch of processes to save space in the post, all had status S>
1 ? Ss 0:00 /sbin/init
245 ? S 0:00 upstart-udev-bridge --daemon
251 ? S /var/run/motd.new
25554 ? S 0:00 \_ run-parts --lsbsysinit /etc/update-motd.d
25558 ? S 0:00 \_ /bin/sh /etc/update-motd.d/10-help-text
25560 ? D 0:00 \_ /bin/sh /etc/update-motd.d/10-help-text
852 ? Ss 0:00 cron
1374 ? S 0:00 \_ CRON
1377 ? Ss 0:00 | \_ /bin/sh -c /var/www/secure/caddy2_prod/scripts/main.pl
1379 ? S 0:02 | \_ /usr/bin/perl /var/www/secure/caddy2_prod/scripts/main.pl
1385 ? Z 0:00 | \_ [check.pl]
1375 ? S 0:00 \_ CRON
1376 ? Ss 0:00 \_ /bin/sh -c /var/www/secure/caddy2_test/scripts/main.pl
1378 ? S 0:00 \_ /usr/bin/perl /var/www/secure/caddy2_test/scripts/main.pl
1384 ? Z 0:00 \_ [check.pl]
855 ? Ss 0:00 atd
868 ? Ssl 6:36 /usr/sbin/mysqld
890 ? S 0:00 /bin/bash /usr/sbin/xe-daemon -p /var/run/xe-daemon.pid
25563 ? S 0:00 \_ /bin/sh /usr/sbin/xe-update-guest-attrs --memory
25564 ? D 0:00 \_ /bin/sh /usr/sbin/xe-update-guest-attrs --memory
1161 ? Ss 0:00 /usr/lib/postfix/master
3102 ? S 0:00 \_ qmgr -l -t fifo -u
22013 ? S 0:00 \_ pickup -l -t fifo -u -c
1181 ? Ssl 3:17 /usr/sbin/asterisk -p -U asterisk
1182 ? S 0:00 \_ astcanary /var/run/asterisk/alt.asterisk.canary.tweet.tweet.tweet
1222 ? Ss 0:00 /usr/sbin/apache2 -k start
31682 ? S 0:01 \_ /usr/sbin/apache2 -k start
31716 ? S 0:01 \_ /usr/sbin/apache2 -k start
13548 ? S 0:00 \_ /usr/sbin/apache2 -k start
25593 ? S 0:00 | \_ /usr/bin/perl -w /usr/lib/cgi-bin/caddy2/patch.pl
25594 ? D 0:00 | \_ /usr/bin/perl -w /usr/lib/cgi-bin/caddy2/patch.pl
13637 ? S 0:00 \_ /usr/sbin/apache2 -k start
16061 ? S 0:00 \_ /usr/sbin/apache2 -k start
23116 ? S 0:00 \_ /usr/sbin/apache2 -k start
25565 ? D 0:00 | \_ /usr/sbin/apache2 -k start
23117 ? S 0:00 \_ /usr/sbin/apache2 -k start
23118 ? S 0:00 \_ /usr/sbin/apache2 -k start
23119 ? S 0:00 \_ /usr/sbin/apache2 -k start
23121 ? S 0:00 \_ /usr/sbin/apache2 -k start
1268 tty1 Ss+ 0:00 /sbin/getty -8 38400 tty1
1396 ? S 0:00 /usr/local/caddy2/servers/test/caddy2serverd localhost caddy2test 1981
<removed a bunch of processes like the one above to save space in the post, there were about 100, all with status S>
25590 ? S 0:00 \_ /usr/local/caddy2/servers/prod/caddy2serverd localhost caddy2prod 1991
25538 ? D 0:00 /bin/bash ./impsys-snap.sh nohup
25596 ? Ss 0:00 /sbin/getty -L hvc0 9600 linux
我确实注意到有几个进程处于 D 状态,我认为这表明它是一个僵尸进程。我不知道这些是否是问题的原因,也不知道 D 状态的进程与 Z 状态的进程之间有什么区别。
如果我们认为这些是导致问题的原因,我该怎么办?我不知道是什么原因导致进程进入 D 状态,因此也不知道如何防止这种情况发生。
非常感谢您的帮助。谢谢!
更新:
我查看了我们的 kern.log,发现它充斥着如下消息:
<removed to clean up post, further detail added below>
其中一些的时间戳似乎与服务器锁定的时间一致,所以我认为这与此有关。我们也已将此信息传递给我们的服务器提供商,但这是否表明了有用的信息?如果是,这是否表明我这边存在问题,还是我的服务器提供商这边存在问题?
更新2:
以下是整个相关时间的 kern.log。在 10:52:24,平均负载开始上升。我得到了ps 压缩文件大约在 10:54:02。一两分钟后(如果 10:56:02 表示的话,可能正好是两分钟),系统变得无响应,我无法执行命令。以下是日志:
Mar 25 08:08:57 cloud kernel: [45483.026983] INFO: task apache2:9642 blocked for more than 120 seconds.
Mar 25 08:08:57 cloud kernel: [45483.026986] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 25 08:08:57 cloud kernel: [45483.026989] apache2 D ea63de60 0 9642 9068 0x00000000
Mar 25 08:08:57 cloud kernel: [45483.026992] ea63de78 00000282 ea63ded4 ea63de60 c01d9096 00000000 00000000 00000000
Mar 25 08:08:57 cloud kernel: [45483.026996] c06f61c0 ea1a6fac c06f61c0 c06f61c0 c06f61c0 ea1a6fac c06f61c0 c06f61c0
Mar 25 08:08:57 cloud kernel: [45483.027000] c1305740 0000296f ea1a6d00 ec844d00 9a352c09 0003c456 ea63dea0 c0106c51
Mar 25 08:08:57 cloud kernel: [45483.027003] Call Trace:
Mar 25 08:08:57 cloud kernel: [45483.027006] [] ? __link_path_walk+0x626/0xc20
Mar 25 08:08:57 cloud kernel: [45483.027010] [] ? sched_clock+0x21/0x80
Mar 25 08:08:57 cloud kernel: [45483.027013] [] schedule_timeout+0x175/0x250
Mar 25 08:08:57 cloud kernel: [45483.027018] [] ? sched_clock_cpu+0x14d/0x190
Mar 25 08:08:57 cloud kernel: [45483.027021] [] ? find_idlest_group+0xa8/0x1b0
Mar 25 08:08:57 cloud kernel: [45483.027023] [] wait_for_common+0xc6/0x180
Mar 25 08:08:57 cloud kernel: [45483.027026] [] ? default_wake_function+0x0/0x10
Mar 25 08:08:57 cloud kernel: [45483.027028] [] wait_for_completion+0x12/0x20
Mar 25 08:08:57 cloud kernel: [45483.027031] [] sched_migrate_task+0xe4/0x100
Mar 25 08:08:57 cloud kernel: [45483.027033] [] sched_exec+0x3b/0x50
Mar 25 08:08:57 cloud kernel: [45483.027036] [] do_execve+0xc4/0x360
Mar 25 08:08:57 cloud kernel: [45483.027038] [] sys_execve+0x28/0x60
Mar 25 08:08:57 cloud kernel: [45483.027041] [] syscall_call+0x7/0xb
Mar 25 09:27:03 cloud kernel: [50344.466167] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
Mar 25 09:27:03 cloud kernel: [50344.466452] CONFIG_NF_CT_ACCT is deprecated and will be removed soon. Please use
Mar 25 09:27:03 cloud kernel: [50344.466454] nf_conntrack.acct=1 kernel parameter, acct=1 nf_conntrack module option or
Mar 25 09:27:03 cloud kernel: [50344.466455] sysctl net.netfilter.nf_conntrack_acct=1 to enable it.
Mar 25 10:52:24 cloud kernel: [55167.785176] BUG: soft lockup - CPU#0 stuck for 61s! [swapper:0]
Mar 25 10:52:24 cloud kernel: [55167.785202] Modules linked in: nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipv6 scsi_mod lp parport
Mar 25 10:52:24 cloud kernel: [55167.785217]
Mar 25 10:52:24 cloud kernel: [55167.785221] Pid: 0, comm: swapper Not tainted (2.6.32-308-ec2 #16-Ubuntu)
Mar 25 10:52:24 cloud kernel: [55167.785224] EIP: 0061:[] EFLAGS: 00000246 CPU: 0
Mar 25 10:52:24 cloud kernel: [55167.785228] EIP is at 0xc01013a7
Mar 25 10:52:24 cloud kernel: [55167.785230] EAX: 00000000 EBX: 00000001 ECX: 00000000 EDX: c0689f58
Mar 25 10:52:24 cloud kernel: [55167.785232] ESI: c06beb08 EDI: a8b0d3bc EBP: c0689f78 ESP: c0689f70
Mar 25 10:52:24 cloud kernel: [55167.785235] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
Mar 25 10:52:24 cloud kernel: [55167.785242] CR0: 8005003b CR2: b6620000 CR3: 2a65e000 CR4: 00002660
Mar 25 10:52:24 cloud kernel: [55167.785247] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
Mar 25 10:52:24 cloud kernel: [55167.785250] DR6: ffff0ff0 DR7: 00000400
Mar 25 10:52:24 cloud kernel: [55167.785252] Call Trace:
Mar 25 10:52:24 cloud kernel: [55167.785259] [] ? xen_safe_halt+0x15/0x40
Mar 25 10:52:24 cloud kernel: [55167.785264] [] xen_idle+0x29/0x80
Mar 25 10:52:24 cloud kernel: [55167.785267] [] cpu_idle+0x8f/0xc0
Mar 25 10:52:24 cloud kernel: [55167.785272] [] rest_init+0x53/0x60
Mar 25 10:52:24 cloud kernel: [55167.785278] [] start_kernel+0x379/0x37f
Mar 25 10:52:24 cloud kernel: [55167.785282] [] ? unknown_bootoption+0x0/0x1a0
Mar 25 10:52:24 cloud kernel: [55167.785286] [] i386_start_kernel+0x67/0x6e
Mar 25 10:53:30 cloud kernel: [55233.281412] BUG: soft lockup - CPU#0 stuck for 61s! [swapper:0]
Mar 25 10:53:30 cloud kernel: [55233.281421] Modules linked in: nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipv6 scsi_mod lp parport
Mar 25 10:53:30 cloud kernel: [55233.281444]
Mar 25 10:53:30 cloud kernel: [55233.281449] Pid: 0, comm: swapper Not tainted (2.6.32-308-ec2 #16-Ubuntu)
Mar 25 10:53:30 cloud kernel: [55233.281453] EIP: 0061:[] EFLAGS: 00000246 CPU: 0
Mar 25 10:53:30 cloud kernel: [55233.281457] EIP is at 0xc01013a7
Mar 25 10:53:30 cloud kernel: [55233.281460] EAX: 00000000 EBX: 00000001 ECX: 00000000 EDX: c0689f58
Mar 25 10:53:30 cloud kernel: [55233.281463] ESI: c06beb08 EDI: a8b0d3bc EBP: c0689f78 ESP: c0689f70
Mar 25 10:53:30 cloud kernel: [55233.281466] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
Mar 25 10:53:30 cloud kernel: [55233.281474] CR0: 8005003b CR2: 09827024 CR3: 013c4000 CR4: 00002660
Mar 25 10:53:30 cloud kernel: [55233.281480] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
Mar 25 10:53:30 cloud kernel: [55233.281484] DR6: ffff0ff0 DR7: 00000400
Mar 25 10:53:30 cloud kernel: [55233.281487] Call Trace:
Mar 25 10:53:30 cloud kernel: [55233.281498] [] ? xen_safe_halt+0x15/0x40
Mar 25 10:53:30 cloud kernel: [55233.281504] [] xen_idle+0x29/0x80
Mar 25 10:53:30 cloud kernel: [55233.281509] [] cpu_idle+0x8f/0xc0
Mar 25 10:53:30 cloud kernel: [55233.281516] [] rest_init+0x53/0x60
Mar 25 10:53:30 cloud kernel: [55233.281524] [] start_kernel+0x379/0x37f
Mar 25 10:53:30 cloud kernel: [55233.281529] [] ? unknown_bootoption+0x0/0x1a0
Mar 25 10:53:30 cloud kernel: [55233.281535] [] i386_start_kernel+0x67/0x6e
Mar 25 10:54:36 cloud kernel: [55298.785478] BUG: soft lockup - CPU#0 stuck for 61s! [swapper:0]
Mar 25 10:54:36 cloud kernel: [55298.785538] Modules linked in: nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipv6 scsi_mod lp parport
Mar 25 10:54:36 cloud kernel: [55298.785551]
Mar 25 10:54:36 cloud kernel: [55298.785554] Pid: 0, comm: swapper Not tainted (2.6.32-308-ec2 #16-Ubuntu)
Mar 25 10:54:36 cloud kernel: [55298.785556] EIP: 0061:[] EFLAGS: 00000246 CPU: 0
Mar 25 10:54:36 cloud kernel: [55298.785560] EIP is at 0xc01013a7
Mar 25 10:54:36 cloud kernel: [55298.785561] EAX: 00000000 EBX: 00000001 ECX: 00000000 EDX: c0689f58
Mar 25 10:54:36 cloud kernel: [55298.785563] ESI: c06beb08 EDI: a8b0d3bc EBP: c0689f78 ESP: c0689f70
Mar 25 10:54:36 cloud kernel: [55298.785565] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
Mar 25 10:54:36 cloud kernel: [55298.785570] CR0: 8005003b CR2: 08ddb00c CR3: 28e7a000 CR4: 00002660
Mar 25 10:54:36 cloud kernel: [55298.785573] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
Mar 25 10:54:36 cloud kernel: [55298.785575] DR6: ffff0ff0 DR7: 00000400
Mar 25 10:54:36 cloud kernel: [55298.785576] Call Trace:
Mar 25 10:54:36 cloud kernel: [55298.785584] [] ? xen_safe_halt+0x15/0x40
Mar 25 10:54:36 cloud kernel: [55298.785588] [] xen_idle+0x29/0x80
Mar 25 10:54:36 cloud kernel: [55298.785591] [] cpu_idle+0x8f/0xc0
Mar 25 10:54:36 cloud kernel: [55298.785596] [] rest_init+0x53/0x60
Mar 25 10:54:36 cloud kernel: [55298.785602] [] start_kernel+0x379/0x37f
Mar 25 10:54:36 cloud kernel: [55298.785605] [] ? unknown_bootoption+0x0/0x1a0
Mar 25 10:54:36 cloud kernel: [55298.785608] [] i386_start_kernel+0x67/0x6e
Mar 25 10:54:57 cloud kernel: [55318.911014] INFO: task impsys-snap.sh:25538 blocked for more than 120 seconds.
Mar 25 10:54:57 cloud kernel: [55318.911056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 25 10:54:57 cloud kernel: [55318.911061] impsys-snap.s D e8ac5e60 0 25538 1 0x00000004
Mar 25 10:54:57 cloud kernel: [55318.911065] e8ac5e78 00000282 ec53bb64 e8ac5e60 c01d9498 00000000 00000000 00000000
Mar 25 10:54:57 cloud kernel: [55318.911072] c06f61c0 c13088dc c06f61c0 c06f61c0 c06f61c0 c13088dc c06f61c0 c06f61c0
Mar 25 10:54:57 cloud kernel: [55318.911077] ea1d5040 00003221 c1308630 c068d280 c012abc8 0003cd53 0004ae99 e8ac5eac
Mar 25 10:54:57 cloud kernel: [55318.911083] Call Trace:
Mar 25 10:54:57 cloud kernel: [55318.911092] [] ? __link_path_walk+0xa28/0xc20
Mar 25 10:54:57 cloud kernel: [55318.911098] [] ? inc_rt_group+0xf8/0x110
Mar 25 10:54:57 cloud kernel: [55318.911103] [] ? update_curr+0x169/0x2c0
Mar 25 10:54:57 cloud kernel: [55318.911114] [] schedule_timeout+0x175/0x250
Mar 25 10:54:57 cloud kernel: [55318.911117] [] ? check_preempt_wakeup+0x152/0x370
Mar 25 10:54:57 cloud kernel: [55318.911120] [] wait_for_common+0xc6/0x180
Mar 25 10:54:57 cloud kernel: [55318.911122] [] ? default_wake_function+0x0/0x10
Mar 25 10:54:57 cloud kernel: [55318.911125] [] wait_for_completion+0x12/0x20
Mar 25 10:54:57 cloud kernel: [55318.911128] [] sched_migrate_task+0xe4/0x100
Mar 25 10:54:57 cloud kernel: [55318.911130] [] sched_exec+0x3b/0x50
Mar 25 10:54:57 cloud kernel: [55318.911134] [] do_execve+0xc4/0x360
Mar 25 10:54:57 cloud kernel: [55318.911137] [] sys_execve+0x28/0x60
Mar 25 10:54:57 cloud kernel: [55318.911139] [] syscall_call+0x7/0xb
Mar 25 10:54:57 cloud kernel: [55318.911142] INFO: task lesspipe:25544 blocked for more than 120 seconds.
Mar 25 10:54:57 cloud kernel: [55318.911145] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 25 10:54:57 cloud kernel: [55318.911148] lesspipe D e9487e60 0 25544 25543 0x00000000
Mar 25 10:54:57 cloud kernel: [55318.911151] e9487e78 00000286 e9487ed4 e9487e60 c01d9096 c01e1204 00000000 00000000
Mar 25 10:54:57 cloud kernel: [55318.911155] c06f61c0 ec3bb4ec c06f61c0 c06f61c0 c06f61c0 ec3bb4ec c06f61c0 c06f61c0
Mar 25 10:54:57 cloud kernel: [55318.911159] ec269900 00003221 ec3bb240 c068d280 6d9860b9 0003cd53 e9487ea0 c0106c51
Mar 25 10:54:57 cloud kernel: [55318.911163] Call Trace:
Mar 25 10:54:57 cloud kernel: [55318.911165] [] ? __link_path_walk+0x626/0xc20
Mar 25 10:54:57 cloud kernel: [55318.911169] [] ? dput+0x84/0x160
Mar 25 10:54:57 cloud kernel: [55318.911172] [] ? sched_clock+0x21/0x80
Mar 25 10:54:57 cloud kernel: [55318.911175] [] schedule_timeout+0x175/0x250
Mar 25 10:54:57 cloud kernel: [55318.911179] [] ? sched_clock_cpu+0x14d/0x190
Mar 25 10:54:57 cloud kernel: [55318.911181] [] ? find_idlest_group+0xa8/0x1b0
Mar 25 10:54:57 cloud kernel: [55318.911184] [] wait_for_common+0xc6/0x180
Mar 25 10:54:57 cloud kernel: [55318.911187] [] ? default_wake_function+0x0/0x10
Mar 25 10:54:57 cloud kernel: [55318.911189] [] wait_for_completion+0x12/0x20
Mar 25 10:54:57 cloud kernel: [55318.911192] [] sched_migrate_task+0xe4/0x100
Mar 25 10:54:57 cloud kernel: [55318.911194] [] sched_exec+0x3b/0x50
Mar 25 10:54:57 cloud kernel: [55318.911197] [] do_execve+0xc4/0x360
Mar 25 10:54:57 cloud kernel: [55318.911199] [] sys_execve+0x28/0x60
Mar 25 10:54:57 cloud kernel: [55318.911201] [] syscall_call+0x7/0xb
Mar 25 10:54:57 cloud kernel: [55318.911204] INFO: task 10-help-text:25560 blocked for more than 120 seconds.
Mar 25 10:54:57 cloud kernel: [55318.911206] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 25 10:54:57 cloud kernel: [55318.911209] 10-help-text D e8995e60 0 25560 25558 0x00000000
Mar 25 10:54:57 cloud kernel: [55318.911211] e8995e78 00000282 e8995ed4 e8995e60 c01d9096 00000000 00000000 00000000
Mar 25 10:54:57 cloud kernel: [55318.911215] c06f61c0 e8af122c c06f61c0 c06f61c0 c06f61c0 e8af122c c06f61c0 c06f61c0
Mar 25 10:54:57 cloud kernel: [55318.911219] ec295e40 00003229 e8af0f80 c068d280 3d76c004 0003cd5c e8995ea0 c0106c51
Mar 25 10:54:57 cloud kernel: [55318.911223] Call Trace:
Mar 25 10:54:57 cloud kernel: [55318.911225] [] ? __link_path_walk+0x626/0xc20
Mar 25 10:54:57 cloud kernel: [55318.911228] [] ? sched_clock+0x21/0x80
Mar 25 10:54:57 cloud kernel: [55318.911231] [] schedule_timeout+0x175/0x250
Mar 25 10:54:57 cloud kernel: [55318.911233] [] ? sched_clock_cpu+0x14d/0x190
Mar 25 10:54:57 cloud kernel: [55318.911236] [] ? find_idlest_group+0xa8/0x1b0
Mar 25 10:54:57 cloud kernel: [55318.911238] [] wait_for_common+0xc6/0x180
Mar 25 10:54:57 cloud kernel: [55318.911241] [] ? default_wake_function+0x0/0x10
Mar 25 10:54:57 cloud kernel: [55318.911243] [] wait_for_completion+0x12/0x20
Mar 25 10:54:57 cloud kernel: [55318.911246] [] sched_migrate_task+0xe4/0x100
Mar 25 10:54:57 cloud kernel: [55318.911248] [] sched_exec+0x3b/0x50
Mar 25 10:54:57 cloud kernel: [55318.911251] [] do_execve+0xc4/0x360
Mar 25 10:54:57 cloud kernel: [55318.911253] [] sys_execve+0x28/0x60
Mar 25 10:54:57 cloud kernel: [55318.911256] [] syscall_call+0x7/0xb
Mar 25 10:54:57 cloud kernel: [55318.911258] INFO: task xe-update-guest:25564 blocked for more than 120 seconds.
Mar 25 10:54:57 cloud kernel: [55318.911260] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 25 10:54:57 cloud kernel: [55318.911263] xe-update-gue D e9535e60 0 25564 25563 0x00000000
Mar 25 10:54:57 cloud kernel: [55318.911265] e9535e78 00000286 e9535ed4 e9535e60 c01d9096 c01e1204 00000000 00000000
Mar 25 10:54:57 cloud kernel: [55318.911269] c06f61c0 ea09685c c06f61c0 c06f61c0 c06f61c0 ea09685c c06f61c0 c06f61c0
Mar 25 10:54:57 cloud kernel: [55318.911273] ec269ac0 0000322b ea0965b0 c068d280 f58f849c 0003cd5d e9535ea0 c0106c51
Mar 25 10:54:57 cloud kernel: [55318.911277] Call Trace:
Mar 25 10:54:57 cloud kernel: [55318.911279] [] ? __link_path_walk+0x626/0xc20
Mar 25 10:54:57 cloud kernel: [55318.911282] [] ? dput+0x84/0x160
Mar 25 10:54:57 cloud kernel: [55318.911284] [] ? sched_clock+0x21/0x80
Mar 25 10:54:57 cloud kernel: [55318.911287] [] schedule_timeout+0x175/0x250
Mar 25 10:54:57 cloud kernel: [55318.911290] [] ? sched_clock_cpu+0x14d/0x190
Mar 25 10:54:57 cloud kernel: [55318.911292] [] ? find_idlest_group+0xa8/0x1b0
Mar 25 10:54:57 cloud kernel: [55318.911294] [] wait_for_common+0xc6/0x180
Mar 25 10:54:57 cloud kernel: [55318.911297] [] ? default_wake_function+0x0/0x10
Mar 25 10:54:57 cloud kernel: [55318.911299] [] wait_for_completion+0x12/0x20
Mar 25 10:54:57 cloud kernel: [55318.911302] [] sched_migrate_task+0xe4/0x100
Mar 25 10:54:57 cloud kernel: [55318.911305] [] sched_exec+0x3b/0x50
Mar 25 10:54:57 cloud kernel: [55318.911307] [] do_execve+0xc4/0x360
Mar 25 10:54:57 cloud kernel: [55318.911310] [] sys_execve+0x28/0x60
Mar 25 10:54:57 cloud kernel: [55318.911312] [] syscall_call+0x7/0xb
Mar 25 10:54:57 cloud kernel: [55318.911314] INFO: task apache2:25565 blocked for more than 120 seconds.
Mar 25 10:54:57 cloud kernel: [55318.911316] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 25 10:54:57 cloud kernel: [55318.911319] apache2 D e899be60 0 25565 23116 0x00000000
Mar 25 10:54:57 cloud kernel: [55318.911321] e899be78 00000282 e899bed4 e899be60 c01d9096 c068afa0 00000000 00000000
Mar 25 10:54:57 cloud kernel: [55318.911325] c06f61c0 e949522c c06f61c0 c06f61c0 c06f61c0 e949522c c06f61c0 c06f61c0
Mar 25 10:54:57 cloud kernel: [55318.911329] ec31f040 0000322c e9494f80 c068d280 1fcfd913 0003cd5f e899bea0 c0106c51
Mar 25 10:54:57 cloud kernel: [55318.911333] Call Trace:
Mar 25 10:54:57 cloud kernel: [55318.911336] [] ? __link_path_walk+0x626/0xc20
Mar 25 10:54:57 cloud kernel: [55318.911338] [] ? sched_clock+0x21/0x80
Mar 25 10:54:57 cloud kernel: [55318.911341] [] schedule_timeout+0x175/0x250
Mar 25 10:54:57 cloud kernel: [55318.911343] [] ? sched_clock_cpu+0x14d/0x190
Mar 25 10:54:57 cloud kernel: [55318.911346] [] ? find_idlest_group+0xa8/0x1b0
Mar 25 10:54:57 cloud kernel: [55318.911348] [] wait_for_common+0xc6/0x180
Mar 25 10:54:57 cloud kernel: [55318.911351] [] ? default_wake_function+0x0/0x10
Mar 25 10:54:57 cloud kernel: [55318.911353] [] wait_for_completion+0x12/0x20
Mar 25 10:54:57 cloud kernel: [55318.911356] [] sched_migrate_task+0xe4/0x100
Mar 25 10:54:57 cloud kernel: [55318.911359] [] sched_exec+0x3b/0x50
Mar 25 10:54:57 cloud kernel: [55318.911361] [] do_execve+0xc4/0x360
Mar 25 10:54:57 cloud kernel: [55318.911364] [] sys_execve+0x28/0x60
Mar 25 10:54:57 cloud kernel: [55318.911366] [] syscall_call+0x7/0xb
Mar 25 10:56:02 cloud kernel: [55383.610034] BUG: soft lockup - CPU#0 stuck for 61s! [swapper:0]
Mar 25 10:56:02 cloud kernel: [55383.610056] Modules linked in: nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipv6 scsi_mod lp parport
Mar 25 10:56:02 cloud kernel: [55383.610073]
Mar 25 10:56:02 cloud kernel: [55383.610077] Pid: 0, comm: swapper Not tainted (2.6.32-308-ec2 #16-Ubuntu)
Mar 25 10:56:02 cloud kernel: [55383.610080] EIP: 0061:[] EFLAGS: 00000246 CPU: 0
Mar 25 10:56:02 cloud kernel: [55383.610084] EIP is at 0xc01013a7
Mar 25 10:56:02 cloud kernel: [55383.610086] EAX: 00000000 EBX: 00000001 ECX: 00000000 EDX: c0689f58
Mar 25 10:56:02 cloud kernel: [55383.610089] ESI: c06beb08 EDI: a8b0d3bc EBP: c0689f78 ESP: c0689f70
Mar 25 10:56:02 cloud kernel: [55383.610091] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
Mar 25 10:56:02 cloud kernel: [55383.610097] CR0: 8005003b CR2: 098c801c CR3: 2a6d2000 CR4: 00002660
Mar 25 10:56:02 cloud kernel: [55383.610104] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
Mar 25 10:56:02 cloud kernel: [55383.610106] DR6: ffff0ff0 DR7: 00000400
Mar 25 10:56:02 cloud kernel: [55383.610107] Call Trace:
Mar 25 10:56:02 cloud kernel: [55383.610115] [] ? xen_safe_halt+0x15/0x40
Mar 25 10:56:02 cloud kernel: [55383.610119] [] xen_idle+0x29/0x80
Mar 25 10:56:02 cloud kernel: [55383.610122] [] cpu_idle+0x8f/0xc0
Mar 25 10:56:02 cloud kernel: [55383.610127] [] rest_init+0x53/0x60
Mar 25 10:56:02 cloud kernel: [55383.610133] [] start_kernel+0x379/0x37f
Mar 25 10:56:02 cloud kernel: [55383.610136] [] ? unknown_bootoption+0x0/0x1a0
Mar 25 10:56:02 cloud kernel: [55383.610139] [] i386_start_kernel+0x67/0x6e
答案1
D
是不可中断等待,以前是D
isk 等待,但现在通常更多地与等待网络文件系统有关。这些进程计入平均负载,这可能是您的问题。
输出中让我印象深刻的ps
是倒数第二行:正在对文件系统 (NAS?) 进行快照,快照期间可能阻止了对文件系统的所有磁盘活动。处理此问题的方法包括但不限于在其上使用并cachefs
更频繁地进行快照以加快速度;使用哪种方法取决于 NAS(无论如何,有些 NAS 在快照期间很糟糕)、可用磁盘空间和您的需求。我会首先寻找 NAS 的支持社区(不一定是供应商支持),看看其他用户想出了什么技巧来最大限度地减少快照延迟。
答案2
最后,问题其实是硬件问题。我们迁移到了另一台服务器,波夫,问题就解决了。至少它让我们认真审视了我们自己的程序中可能存在的性能问题。我们发现了一些可能被忽视但应该清理的问题。