更新 I (/var/log/messages 的部分)

更新 I (/var/log/messages 的部分)

我们在工作时有一台服务器(ESXi虚拟机),它会不时地因为“内核恐慌:内存不足且没有可终止的进程...”而冻结。

主机内存为12GB。

虚拟机配置

  • VMware ESXi
    • VM 版本 7
    • 2 CPU
    • 内存 8192
    • 内存预留 0,内存限制设置 = 无限制
  • SuSe 11.3(64位)+内核2.6.34-12

  • firebird、postresql、db2

  • php5.3,PHP-FPM,LIGHTTPD,MEMCACHED,OOo

电脑使用率不高,每天崩溃一次,两天崩溃一次。有时周末也会崩溃。

我如何才能找出导致服务器崩溃的原因?

从 vmware.log 文件中提取

Apr 03 07:21:22.266: vcpu-0| Vix: [17514025 vmxCommands.c:7612]: VMAutomation_HandleCLIHLTEvent. Do nothing.
Apr 03 07:21:22.266: vcpu-0| Msg_Hint: msg.monitorevent.halt (sent)
Apr 03 07:21:22.266: vcpu-0| The CPU has been disabled by the guest operating system. You will need to power off or reset the virtual machine at this point.
Apr 03 07:21:22.266: vcpu-0| ---------------------------------------
Apr 03 07:21:37.167: vmx| GuestRpcSendTimedOut: message to toolbox timed out.
Apr 03 07:21:37.167: vmx| GuestRpc: app toolbox's second ping timeout; assuming app is down
Apr 03 22:30:06.017: mks| MKS: Base polling period is 10000us

更新 I (/var/log/messages 的部分)

从 /var/log/messages 中提取一切(可能)开始的地方。我将从/opt/eduserver/bin/phpcron 中删除,然后我们将查看崩溃是否会再次发生。

Apr  9 22:15:02 testing /usr/sbin/cron[4312]: (root) CMD (/opt/eduserver/bin/php /srv/www/htdocs/imacs/radek/trunk/lib/views/edu_scheduler/controllers/action_scheduler.php >/var/lib/edumate/imacs/radek/trunk/scheduler )
Apr  9 22:15:20 testing kernel: [115148.493482] oom_kill_process: 3 callbacks suppressed
Apr  9 22:15:20 testing kernel: [115148.493485] php invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Apr  9 22:15:20 testing kernel: [115148.493488] Pid: 4317, comm: php Not tainted 2.6.34-12-desktop #1
Apr  9 22:15:20 testing kernel: [115148.493490] Call Trace:
Apr  9 22:15:20 testing kernel: [115148.493511]  [<ffffffff81005ca9>] dump_trace+0x79/0x340
Apr  9 22:15:20 testing kernel: [115148.493516]  [<ffffffff8149e612>] dump_stack+0x69/0x6f
Apr  9 22:15:20 testing kernel: [115148.493522]  [<ffffffff810dbae0>] dump_header.clone.1+0x70/0x1a0
Apr  9 22:15:20 testing kernel: [115148.493525]  [<ffffffff810dbc8e>] oom_kill_process.clone.0+0x7e/0x150
Apr  9 22:15:20 testing kernel: [115148.493529]  [<ffffffff810dc0cb>] __out_of_memory+0x10b/0x180
Apr  9 22:15:20 testing kernel: [115148.493533]  [<ffffffff810dc3c8>] out_of_memory+0x88/0x190
Apr  9 22:15:20 testing kernel: [115148.493536]  [<ffffffff810e073a>] __alloc_pages_nodemask+0x69a/0x6b0
Apr  9 22:15:20 testing kernel: [115148.493541]  [<ffffffff810e35a4>] __do_page_cache_readahead+0x114/0x290
Apr  9 22:15:20 testing kernel: [115148.493545]  [<ffffffff810e389c>] ra_submit+0x1c/0x30
Apr  9 22:15:20 testing kernel: [115148.493548]  [<ffffffff810d9e9f>] filemap_fault+0x3cf/0x410
Apr  9 22:15:20 testing kernel: [115148.493553]  [<ffffffff810f4fc2>] __do_fault+0x52/0x520
Apr  9 22:15:20 testing kernel: [115148.493557]  [<ffffffff810f9933>] handle_mm_fault+0x1a3/0x450
Apr  9 22:15:20 testing kernel: [115148.493561]  [<ffffffff814a4b34>] do_page_fault+0x194/0x450
Apr  9 22:15:20 testing kernel: [115148.493565]  [<ffffffff814a1fcf>] page_fault+0x1f/0x30
Apr  9 22:15:20 testing kernel: [115148.493587]  [<00007f52b7d4cce5>] 0x7f52b7d4cce5
Apr  9 22:15:20 testing kernel: [115148.493588] Mem-Info:
Apr  9 22:15:20 testing kernel: [115148.493590] Node 0 DMA per-cpu:
Apr  9 22:15:20 testing kernel: [115148.493592] CPU    0: hi:    0, btch:   1 usd:   0
Apr  9 22:15:20 testing kernel: [115148.493593] CPU    1: hi:    0, btch:   1 usd:   0
Apr  9 22:15:20 testing kernel: [115148.493595] Node 0 DMA32 per-cpu:
Apr  9 22:15:20 testing kernel: [115148.493597] CPU    0: hi:  186, btch:  31 usd: 155
Apr  9 22:15:20 testing kernel: [115148.493598] CPU    1: hi:  186, btch:  31 usd: 161
Apr  9 22:15:20 testing kernel: [115148.493600] Node 0 Normal per-cpu:
Apr  9 22:15:20 testing kernel: [115148.493601] CPU    0: hi:  186, btch:  31 usd: 173
Apr  9 22:15:20 testing kernel: [115148.493603] CPU    1: hi:  186, btch:  31 usd:  57
Apr  9 22:15:20 testing kernel: [115148.493607] active_anon:1465647 inactive_anon:288016 isolated_anon:0
Apr  9 22:15:20 testing kernel: [115148.493607]  active_file:129 inactive_file:784 isolated_file:0
Apr  9 22:15:20 testing kernel: [115148.493608]  unevictable:0 dirty:0 writeback:0 unstable:0
Apr  9 22:15:20 testing kernel: [115148.493609]  free:11853 slab_reclaimable:4721 slab_unreclaimable:64985
Apr  9 22:15:20 testing kernel: [115148.493609]  mapped:14998 shmem:15500 pagetables:161144 bounce:0
Apr  9 22:15:20 testing kernel: [115148.493611] Node 0 DMA free:15812kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15708kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Apr  9 22:15:20 testing kernel: [115148.493618] lowmem_reserve[]: 0 3000 8050 8050
Apr  9 22:15:20 testing kernel: [115148.493621] Node 0 DMA32 free:24432kB min:4272kB low:5340kB high:6408kB active_anon:2097640kB inactive_anon:524448kB active_file:52kB inactive_file:64kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3072160kB mlocked:0kB dirty:0kB writeback:0kB mapped:448kB shmem:360kB slab_reclaimable:1988kB slab_unreclaimable:97472kB kernel_stack:17712kB pagetables:239608kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:144 all_unreclaimable? no
Apr  9 22:15:20 testing kernel: [115148.493629] lowmem_reserve[]: 0 0 5050 5050
Apr  9 22:15:20 testing kernel: [115148.493631] Node 0 Normal free:7168kB min:7192kB low:8988kB high:10788kB active_anon:3764948kB inactive_anon:627616kB active_file:464kB inactive_file:3072kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:5171200kB mlocked:0kB dirty:0kB writeback:0kB mapped:59544kB shmem:61640kB slab_reclaimable:16896kB slab_unreclaimable:162468kB kernel_stack:28984kB pagetables:404968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1440 all_unreclaimable? yes
Apr  9 22:15:20 testing kernel: [115148.493639] lowmem_reserve[]: 0 0 0 0
Apr  9 22:15:20 testing kernel: [115148.493641] Node 0 DMA: 3*4kB 1*8kB 1*16kB 1*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15812kB
Apr  9 22:15:20 testing kernel: [115148.493648] Node 0 DMA32: 272*4kB 140*8kB 31*16kB 127*32kB 84*64kB 42*128kB 11*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 24432kB
Apr  9 22:15:20 testing kernel: [115148.493655] Node 0 Normal: 840*4kB 26*8kB 1*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 7168kB
Apr  9 22:15:20 testing kernel: [115148.493662] 19767 total pagecache pages
Apr  9 22:15:20 testing kernel: [115148.493663] 3345 pages in swap cache
Apr  9 22:15:20 testing kernel: [115148.493664] Swap cache stats: add 531666, delete 528321, find 103411/104065
Apr  9 22:15:20 testing kernel: [115148.493666] Free swap  = 0kB
Apr  9 22:15:20 testing kernel: [115148.493667] Total swap = 2103292kB
Apr  9 22:15:20 testing kernel: [115148.514162] 2097136 pages RAM
Apr  9 22:15:20 testing kernel: [115148.514164] 48271 pages reserved
Apr  9 22:15:20 testing kernel: [115148.514165] 106772 pages shared
Apr  9 22:15:20 testing kernel: [115148.514166] 2006923 pages non-shared
Apr  9 22:15:20 testing kernel: [115148.514169] Out of memory: kill process 3016 (cron) score 308233 or a child
Apr  9 22:15:20 testing kernel: [115148.514171] Killed process 15546 (cron) vsz:50064kB, anon-rss:272kB, file-rss:32kB
Apr  9 22:16:01 testing /usr/sbin/cron[4347]: (root) CMD (/usr/bin/ruby /root/radek/scripts/freemem.rb)
Apr  9 22:17:07 testing kernel: [115255.428734] vmtoolsd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Apr  9 22:17:07 testing kernel: [115255.428738] Pid: 2772, comm: vmtoolsd Not tainted 2.6.34-12-desktop #1
Apr  9 22:17:08 testing kernel: [115255.428740] Call Trace:
Apr  9 22:17:08 testing kernel: [115255.428751]  [<ffffffff81005ca9>] dump_trace+0x79/0x340
Apr  9 22:17:08 testing kernel: [115255.428756]  [<ffffffff8149e612>] dump_stack+0x69/0x6f
Apr  9 22:17:08 testing kernel: [115255.428762]  [<ffffffff810dbae0>] dump_header.clone.1+0x70/0x1a0
Apr  9 22:17:08 testing kernel: [115255.428765]  [<ffffffff810dbc8e>] oom_kill_process.clone.0+0x7e/0x150
Apr  9 22:17:08 testing kernel: [115255.428769]  [<ffffffff810dc0cb>] __out_of_memory+0x10b/0x180
Apr  9 22:17:08 testing kernel: [115255.428773]  [<ffffffff810dc3c8>] out_of_memory+0x88/0x190
Apr  9 22:17:08 testing kernel: [115255.428777]  [<ffffffff810e073a>] __alloc_pages_nodemask+0x69a/0x6b0
Apr  9 22:17:08 testing kernel: [115255.428781]  [<ffffffff810e35a4>] __do_page_cache_readahead+0x114/0x290
Apr  9 22:17:08 testing kernel: [115255.428785]  [<ffffffff810e389c>] ra_submit+0x1c/0x30
Apr  9 22:17:08 testing kernel: [115255.428788]  [<ffffffff810d9e9f>] filemap_fault+0x3cf/0x410
Apr  9 22:17:08 testing kernel: [115255.428793]  [<ffffffff810f4fc2>] __do_fault+0x52/0x520
Apr  9 22:17:08 testing kernel: [115255.428802]  [<ffffffff810f9933>] handle_mm_fault+0x1a3/0x450
Apr  9 22:17:08 testing kernel: [115255.428824]  [<ffffffff814a4b34>] do_page_fault+0x194/0x450
Apr  9 22:17:08 testing kernel: [115255.428828]  [<ffffffff814a1fcf>] page_fault+0x1f/0x30
Apr  9 22:17:08 testing kernel: [115255.428841]  [<00007f09951973c0>] 0x7f09951973c0
Apr  9 22:17:08 testing kernel: [115255.428842] Mem-Info:
Apr  9 22:17:08 testing kernel: [115255.428844] Node 0 DMA per-cpu:
Apr  9 22:17:08 testing kernel: [115255.428846] CPU    0: hi:    0, btch:   1 usd:   0
Apr  9 22:17:08 testing kernel: [115255.428847] CPU    1: hi:    0, btch:   1 usd:   0
Apr  9 22:17:08 testing kernel: [115255.428848] Node 0 DMA32 per-cpu:
Apr  9 22:17:08 testing kernel: [115255.428850] CPU    0: hi:  186, btch:  31 usd:  44
Apr  9 22:17:08 testing kernel: [115255.428852] CPU    1: hi:  186, btch:  31 usd: 174
Apr  9 22:17:08 testing kernel: [115255.428853] Node 0 Normal per-cpu:
Apr  9 22:17:08 testing kernel: [115255.428855] CPU    0: hi:  186, btch:  31 usd: 146
Apr  9 22:17:08 testing kernel: [115255.428856] CPU    1: hi:  186, btch:  31 usd: 171
Apr  9 22:17:08 testing kernel: [115255.428860] active_anon:1464570 inactive_anon:287629 isolated_anon:0
Apr  9 22:17:08 testing kernel: [115255.428861]  active_file:66 inactive_file:2047 isolated_file:64
Apr  9 22:17:08 testing kernel: [115255.428862]  unevictable:0 dirty:0 writeback:0 unstable:0
Apr  9 22:17:08 testing kernel: [115255.428862]  free:11882 slab_reclaimable:4727 slab_unreclaimable:64987
Apr  9 22:17:08 testing kernel: [115255.428863]  mapped:15715 shmem:15500 pagetables:161192 bounce:0
Apr  9 22:17:08 testing kernel: [115255.428865] Node 0 DMA free:15812kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15708kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Apr  9 22:17:08 testing kernel: [115255.428872] lowmem_reserve[]: 0 3000 8050 8050
Apr  9 22:17:08 testing kernel: [115255.428875] Node 0 DMA32 free:24448kB min:4272kB low:5340kB high:6408kB active_anon:2091648kB inactive_anon:522644kB active_file:176kB inactive_file:7944kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3072160kB mlocked:0kB dirty:0kB writeback:0kB mapped:3496kB shmem:360kB slab_reclaimable:2004kB slab_unreclaimable:97488kB kernel_stack:17712kB pagetables:239656kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:210 all_unreclaimable? yes
Apr  9 22:17:08 testing kernel: [115255.428882] lowmem_reserve[]: 0 0 5050 5050
Apr  9 22:17:08 testing kernel: [115255.428885] Node 0 Normal free:7268kB min:7192kB low:8988kB high:10788kB active_anon:3766632kB inactive_anon:627872kB active_file:88kB inactive_file:244kB unevictable:0kB isolated(anon):0kB isolated(file):256kB present:5171200kB mlocked:0kB dirty:0kB writeback:0kB mapped:59364kB shmem:61640kB slab_reclaimable:16904kB slab_unreclaimable:162460kB kernel_stack:29000kB pagetables:405112kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:129 all_unreclaimable? yes
Apr  9 22:17:08 testing kernel: [115255.428893] lowmem_reserve[]: 0 0 0 0
Apr  9 22:17:08 testing kernel: [115255.428895] Node 0 DMA: 3*4kB 1*8kB 1*16kB 1*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15812kB
Apr  9 22:17:08 testing kernel: [115255.428902] Node 0 DMA32: 278*4kB 127*8kB 33*16kB 119*32kB 81*64kB 44*128kB 6*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 24448kB
Apr  9 22:17:08 testing kernel: [115255.428909] Node 0 Normal: 881*4kB 20*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 7268kB
Apr  9 22:17:08 testing kernel: [115255.428915] 18755 total pagecache pages
Apr  9 22:17:08 testing kernel: [115255.428916] 1043 pages in swap cache
Apr  9 22:17:08 testing kernel: [115255.428918] Swap cache stats: add 531680, delete 530637, find 103628/104282
Apr  9 22:17:08 testing kernel: [115255.428919] Free swap  = 0kB
Apr  9 22:17:08 testing kernel: [115255.428920] Total swap = 2103292kB
Apr  9 22:17:08 testing kernel: [115255.447686] 2097136 pages RAM
Apr  9 22:17:08 testing kernel: [115255.447688] 48271 pages reserved
Apr  9 22:17:08 testing kernel: [115255.447689] 64969 pages shared
Apr  9 22:17:08 testing kernel: [115255.447690] 2006202 pages non-shared
Apr  9 22:17:08 testing kernel: [115255.447693] Out of memory: kill process 3016 (cron) score 308364 or a child
Apr  9 22:17:08 testing kernel: [115255.447696] Killed process 15547 (cron) vsz:50064kB, anon-rss:316kB, file-rss:4kB
Apr  9 22:17:08 testing kernel: [115255.753860] db2sysc invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Apr  9 22:17:08 testing kernel: [115255.753864] Pid: 3346, comm: db2sysc Not tainted 2.6.34-12-desktop #1

答案1

为 Suse 实例分配了多少内存?考虑到您在其上运行了大量占用大量内存的服务(3 个 RDBMS 加上 memcached),它将需要 8GB 内存中的相当一部分来运行。

您需要检查 Suse 实例的 ESXi 中的内存预留和限制设置 - 请记住,如果限制设置太低,可能会强制机器换出甚至崩溃。

答案2

ps您必须找到使用过多内存的罪魁祸首。您可以使用一个简单的脚本来做到这一点,该脚本不时记录输出,并使用监控工具,例如穆宁

如果不仔细观察正在发生的事情,就很难知道是谁在消耗你的内存并进行交换,以至于没有可用的内存,甚至我倾向于先猜测数据库。

相关内容