我们在工作时有一台服务器(ESXi虚拟机),它会不时地因为“内核恐慌:内存不足且没有可终止的进程...”而冻结。
主机内存为12GB。
虚拟机配置
- VMware ESXi
- VM 版本 7
- 2 CPU
- 内存 8192
- 内存预留 0,内存限制设置 = 无限制
SuSe 11.3(64位)+内核2.6.34-12
firebird、postresql、db2
- php5.3,PHP-FPM,LIGHTTPD,MEMCACHED,OOo
电脑使用率不高,每天崩溃一次,两天崩溃一次。有时周末也会崩溃。
我如何才能找出导致服务器崩溃的原因?
从 vmware.log 文件中提取
Apr 03 07:21:22.266: vcpu-0| Vix: [17514025 vmxCommands.c:7612]: VMAutomation_HandleCLIHLTEvent. Do nothing.
Apr 03 07:21:22.266: vcpu-0| Msg_Hint: msg.monitorevent.halt (sent)
Apr 03 07:21:22.266: vcpu-0| The CPU has been disabled by the guest operating system. You will need to power off or reset the virtual machine at this point.
Apr 03 07:21:22.266: vcpu-0| ---------------------------------------
Apr 03 07:21:37.167: vmx| GuestRpcSendTimedOut: message to toolbox timed out.
Apr 03 07:21:37.167: vmx| GuestRpc: app toolbox's second ping timeout; assuming app is down
Apr 03 22:30:06.017: mks| MKS: Base polling period is 10000us
更新 I (/var/log/messages 的部分)
从 /var/log/messages 中提取一切(可能)开始的地方。我将从/opt/eduserver/bin/php
cron 中删除,然后我们将查看崩溃是否会再次发生。
Apr 9 22:15:02 testing /usr/sbin/cron[4312]: (root) CMD (/opt/eduserver/bin/php /srv/www/htdocs/imacs/radek/trunk/lib/views/edu_scheduler/controllers/action_scheduler.php >/var/lib/edumate/imacs/radek/trunk/scheduler )
Apr 9 22:15:20 testing kernel: [115148.493482] oom_kill_process: 3 callbacks suppressed
Apr 9 22:15:20 testing kernel: [115148.493485] php invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Apr 9 22:15:20 testing kernel: [115148.493488] Pid: 4317, comm: php Not tainted 2.6.34-12-desktop #1
Apr 9 22:15:20 testing kernel: [115148.493490] Call Trace:
Apr 9 22:15:20 testing kernel: [115148.493511] [<ffffffff81005ca9>] dump_trace+0x79/0x340
Apr 9 22:15:20 testing kernel: [115148.493516] [<ffffffff8149e612>] dump_stack+0x69/0x6f
Apr 9 22:15:20 testing kernel: [115148.493522] [<ffffffff810dbae0>] dump_header.clone.1+0x70/0x1a0
Apr 9 22:15:20 testing kernel: [115148.493525] [<ffffffff810dbc8e>] oom_kill_process.clone.0+0x7e/0x150
Apr 9 22:15:20 testing kernel: [115148.493529] [<ffffffff810dc0cb>] __out_of_memory+0x10b/0x180
Apr 9 22:15:20 testing kernel: [115148.493533] [<ffffffff810dc3c8>] out_of_memory+0x88/0x190
Apr 9 22:15:20 testing kernel: [115148.493536] [<ffffffff810e073a>] __alloc_pages_nodemask+0x69a/0x6b0
Apr 9 22:15:20 testing kernel: [115148.493541] [<ffffffff810e35a4>] __do_page_cache_readahead+0x114/0x290
Apr 9 22:15:20 testing kernel: [115148.493545] [<ffffffff810e389c>] ra_submit+0x1c/0x30
Apr 9 22:15:20 testing kernel: [115148.493548] [<ffffffff810d9e9f>] filemap_fault+0x3cf/0x410
Apr 9 22:15:20 testing kernel: [115148.493553] [<ffffffff810f4fc2>] __do_fault+0x52/0x520
Apr 9 22:15:20 testing kernel: [115148.493557] [<ffffffff810f9933>] handle_mm_fault+0x1a3/0x450
Apr 9 22:15:20 testing kernel: [115148.493561] [<ffffffff814a4b34>] do_page_fault+0x194/0x450
Apr 9 22:15:20 testing kernel: [115148.493565] [<ffffffff814a1fcf>] page_fault+0x1f/0x30
Apr 9 22:15:20 testing kernel: [115148.493587] [<00007f52b7d4cce5>] 0x7f52b7d4cce5
Apr 9 22:15:20 testing kernel: [115148.493588] Mem-Info:
Apr 9 22:15:20 testing kernel: [115148.493590] Node 0 DMA per-cpu:
Apr 9 22:15:20 testing kernel: [115148.493592] CPU 0: hi: 0, btch: 1 usd: 0
Apr 9 22:15:20 testing kernel: [115148.493593] CPU 1: hi: 0, btch: 1 usd: 0
Apr 9 22:15:20 testing kernel: [115148.493595] Node 0 DMA32 per-cpu:
Apr 9 22:15:20 testing kernel: [115148.493597] CPU 0: hi: 186, btch: 31 usd: 155
Apr 9 22:15:20 testing kernel: [115148.493598] CPU 1: hi: 186, btch: 31 usd: 161
Apr 9 22:15:20 testing kernel: [115148.493600] Node 0 Normal per-cpu:
Apr 9 22:15:20 testing kernel: [115148.493601] CPU 0: hi: 186, btch: 31 usd: 173
Apr 9 22:15:20 testing kernel: [115148.493603] CPU 1: hi: 186, btch: 31 usd: 57
Apr 9 22:15:20 testing kernel: [115148.493607] active_anon:1465647 inactive_anon:288016 isolated_anon:0
Apr 9 22:15:20 testing kernel: [115148.493607] active_file:129 inactive_file:784 isolated_file:0
Apr 9 22:15:20 testing kernel: [115148.493608] unevictable:0 dirty:0 writeback:0 unstable:0
Apr 9 22:15:20 testing kernel: [115148.493609] free:11853 slab_reclaimable:4721 slab_unreclaimable:64985
Apr 9 22:15:20 testing kernel: [115148.493609] mapped:14998 shmem:15500 pagetables:161144 bounce:0
Apr 9 22:15:20 testing kernel: [115148.493611] Node 0 DMA free:15812kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15708kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Apr 9 22:15:20 testing kernel: [115148.493618] lowmem_reserve[]: 0 3000 8050 8050
Apr 9 22:15:20 testing kernel: [115148.493621] Node 0 DMA32 free:24432kB min:4272kB low:5340kB high:6408kB active_anon:2097640kB inactive_anon:524448kB active_file:52kB inactive_file:64kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3072160kB mlocked:0kB dirty:0kB writeback:0kB mapped:448kB shmem:360kB slab_reclaimable:1988kB slab_unreclaimable:97472kB kernel_stack:17712kB pagetables:239608kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:144 all_unreclaimable? no
Apr 9 22:15:20 testing kernel: [115148.493629] lowmem_reserve[]: 0 0 5050 5050
Apr 9 22:15:20 testing kernel: [115148.493631] Node 0 Normal free:7168kB min:7192kB low:8988kB high:10788kB active_anon:3764948kB inactive_anon:627616kB active_file:464kB inactive_file:3072kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:5171200kB mlocked:0kB dirty:0kB writeback:0kB mapped:59544kB shmem:61640kB slab_reclaimable:16896kB slab_unreclaimable:162468kB kernel_stack:28984kB pagetables:404968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1440 all_unreclaimable? yes
Apr 9 22:15:20 testing kernel: [115148.493639] lowmem_reserve[]: 0 0 0 0
Apr 9 22:15:20 testing kernel: [115148.493641] Node 0 DMA: 3*4kB 1*8kB 1*16kB 1*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15812kB
Apr 9 22:15:20 testing kernel: [115148.493648] Node 0 DMA32: 272*4kB 140*8kB 31*16kB 127*32kB 84*64kB 42*128kB 11*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 24432kB
Apr 9 22:15:20 testing kernel: [115148.493655] Node 0 Normal: 840*4kB 26*8kB 1*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 7168kB
Apr 9 22:15:20 testing kernel: [115148.493662] 19767 total pagecache pages
Apr 9 22:15:20 testing kernel: [115148.493663] 3345 pages in swap cache
Apr 9 22:15:20 testing kernel: [115148.493664] Swap cache stats: add 531666, delete 528321, find 103411/104065
Apr 9 22:15:20 testing kernel: [115148.493666] Free swap = 0kB
Apr 9 22:15:20 testing kernel: [115148.493667] Total swap = 2103292kB
Apr 9 22:15:20 testing kernel: [115148.514162] 2097136 pages RAM
Apr 9 22:15:20 testing kernel: [115148.514164] 48271 pages reserved
Apr 9 22:15:20 testing kernel: [115148.514165] 106772 pages shared
Apr 9 22:15:20 testing kernel: [115148.514166] 2006923 pages non-shared
Apr 9 22:15:20 testing kernel: [115148.514169] Out of memory: kill process 3016 (cron) score 308233 or a child
Apr 9 22:15:20 testing kernel: [115148.514171] Killed process 15546 (cron) vsz:50064kB, anon-rss:272kB, file-rss:32kB
Apr 9 22:16:01 testing /usr/sbin/cron[4347]: (root) CMD (/usr/bin/ruby /root/radek/scripts/freemem.rb)
Apr 9 22:17:07 testing kernel: [115255.428734] vmtoolsd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Apr 9 22:17:07 testing kernel: [115255.428738] Pid: 2772, comm: vmtoolsd Not tainted 2.6.34-12-desktop #1
Apr 9 22:17:08 testing kernel: [115255.428740] Call Trace:
Apr 9 22:17:08 testing kernel: [115255.428751] [<ffffffff81005ca9>] dump_trace+0x79/0x340
Apr 9 22:17:08 testing kernel: [115255.428756] [<ffffffff8149e612>] dump_stack+0x69/0x6f
Apr 9 22:17:08 testing kernel: [115255.428762] [<ffffffff810dbae0>] dump_header.clone.1+0x70/0x1a0
Apr 9 22:17:08 testing kernel: [115255.428765] [<ffffffff810dbc8e>] oom_kill_process.clone.0+0x7e/0x150
Apr 9 22:17:08 testing kernel: [115255.428769] [<ffffffff810dc0cb>] __out_of_memory+0x10b/0x180
Apr 9 22:17:08 testing kernel: [115255.428773] [<ffffffff810dc3c8>] out_of_memory+0x88/0x190
Apr 9 22:17:08 testing kernel: [115255.428777] [<ffffffff810e073a>] __alloc_pages_nodemask+0x69a/0x6b0
Apr 9 22:17:08 testing kernel: [115255.428781] [<ffffffff810e35a4>] __do_page_cache_readahead+0x114/0x290
Apr 9 22:17:08 testing kernel: [115255.428785] [<ffffffff810e389c>] ra_submit+0x1c/0x30
Apr 9 22:17:08 testing kernel: [115255.428788] [<ffffffff810d9e9f>] filemap_fault+0x3cf/0x410
Apr 9 22:17:08 testing kernel: [115255.428793] [<ffffffff810f4fc2>] __do_fault+0x52/0x520
Apr 9 22:17:08 testing kernel: [115255.428802] [<ffffffff810f9933>] handle_mm_fault+0x1a3/0x450
Apr 9 22:17:08 testing kernel: [115255.428824] [<ffffffff814a4b34>] do_page_fault+0x194/0x450
Apr 9 22:17:08 testing kernel: [115255.428828] [<ffffffff814a1fcf>] page_fault+0x1f/0x30
Apr 9 22:17:08 testing kernel: [115255.428841] [<00007f09951973c0>] 0x7f09951973c0
Apr 9 22:17:08 testing kernel: [115255.428842] Mem-Info:
Apr 9 22:17:08 testing kernel: [115255.428844] Node 0 DMA per-cpu:
Apr 9 22:17:08 testing kernel: [115255.428846] CPU 0: hi: 0, btch: 1 usd: 0
Apr 9 22:17:08 testing kernel: [115255.428847] CPU 1: hi: 0, btch: 1 usd: 0
Apr 9 22:17:08 testing kernel: [115255.428848] Node 0 DMA32 per-cpu:
Apr 9 22:17:08 testing kernel: [115255.428850] CPU 0: hi: 186, btch: 31 usd: 44
Apr 9 22:17:08 testing kernel: [115255.428852] CPU 1: hi: 186, btch: 31 usd: 174
Apr 9 22:17:08 testing kernel: [115255.428853] Node 0 Normal per-cpu:
Apr 9 22:17:08 testing kernel: [115255.428855] CPU 0: hi: 186, btch: 31 usd: 146
Apr 9 22:17:08 testing kernel: [115255.428856] CPU 1: hi: 186, btch: 31 usd: 171
Apr 9 22:17:08 testing kernel: [115255.428860] active_anon:1464570 inactive_anon:287629 isolated_anon:0
Apr 9 22:17:08 testing kernel: [115255.428861] active_file:66 inactive_file:2047 isolated_file:64
Apr 9 22:17:08 testing kernel: [115255.428862] unevictable:0 dirty:0 writeback:0 unstable:0
Apr 9 22:17:08 testing kernel: [115255.428862] free:11882 slab_reclaimable:4727 slab_unreclaimable:64987
Apr 9 22:17:08 testing kernel: [115255.428863] mapped:15715 shmem:15500 pagetables:161192 bounce:0
Apr 9 22:17:08 testing kernel: [115255.428865] Node 0 DMA free:15812kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15708kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Apr 9 22:17:08 testing kernel: [115255.428872] lowmem_reserve[]: 0 3000 8050 8050
Apr 9 22:17:08 testing kernel: [115255.428875] Node 0 DMA32 free:24448kB min:4272kB low:5340kB high:6408kB active_anon:2091648kB inactive_anon:522644kB active_file:176kB inactive_file:7944kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3072160kB mlocked:0kB dirty:0kB writeback:0kB mapped:3496kB shmem:360kB slab_reclaimable:2004kB slab_unreclaimable:97488kB kernel_stack:17712kB pagetables:239656kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:210 all_unreclaimable? yes
Apr 9 22:17:08 testing kernel: [115255.428882] lowmem_reserve[]: 0 0 5050 5050
Apr 9 22:17:08 testing kernel: [115255.428885] Node 0 Normal free:7268kB min:7192kB low:8988kB high:10788kB active_anon:3766632kB inactive_anon:627872kB active_file:88kB inactive_file:244kB unevictable:0kB isolated(anon):0kB isolated(file):256kB present:5171200kB mlocked:0kB dirty:0kB writeback:0kB mapped:59364kB shmem:61640kB slab_reclaimable:16904kB slab_unreclaimable:162460kB kernel_stack:29000kB pagetables:405112kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:129 all_unreclaimable? yes
Apr 9 22:17:08 testing kernel: [115255.428893] lowmem_reserve[]: 0 0 0 0
Apr 9 22:17:08 testing kernel: [115255.428895] Node 0 DMA: 3*4kB 1*8kB 1*16kB 1*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15812kB
Apr 9 22:17:08 testing kernel: [115255.428902] Node 0 DMA32: 278*4kB 127*8kB 33*16kB 119*32kB 81*64kB 44*128kB 6*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 24448kB
Apr 9 22:17:08 testing kernel: [115255.428909] Node 0 Normal: 881*4kB 20*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 7268kB
Apr 9 22:17:08 testing kernel: [115255.428915] 18755 total pagecache pages
Apr 9 22:17:08 testing kernel: [115255.428916] 1043 pages in swap cache
Apr 9 22:17:08 testing kernel: [115255.428918] Swap cache stats: add 531680, delete 530637, find 103628/104282
Apr 9 22:17:08 testing kernel: [115255.428919] Free swap = 0kB
Apr 9 22:17:08 testing kernel: [115255.428920] Total swap = 2103292kB
Apr 9 22:17:08 testing kernel: [115255.447686] 2097136 pages RAM
Apr 9 22:17:08 testing kernel: [115255.447688] 48271 pages reserved
Apr 9 22:17:08 testing kernel: [115255.447689] 64969 pages shared
Apr 9 22:17:08 testing kernel: [115255.447690] 2006202 pages non-shared
Apr 9 22:17:08 testing kernel: [115255.447693] Out of memory: kill process 3016 (cron) score 308364 or a child
Apr 9 22:17:08 testing kernel: [115255.447696] Killed process 15547 (cron) vsz:50064kB, anon-rss:316kB, file-rss:4kB
Apr 9 22:17:08 testing kernel: [115255.753860] db2sysc invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Apr 9 22:17:08 testing kernel: [115255.753864] Pid: 3346, comm: db2sysc Not tainted 2.6.34-12-desktop #1
答案1
为 Suse 实例分配了多少内存?考虑到您在其上运行了大量占用大量内存的服务(3 个 RDBMS 加上 memcached),它将需要 8GB 内存中的相当一部分来运行。
您需要检查 Suse 实例的 ESXi 中的内存预留和限制设置 - 请记住,如果限制设置太低,可能会强制机器换出甚至崩溃。
答案2
ps
您必须找到使用过多内存的罪魁祸首。您可以使用一个简单的脚本来做到这一点,该脚本不时记录输出,并使用监控工具,例如穆宁。
如果不仔细观察正在发生的事情,就很难知道是谁在消耗你的内存并进行交换,以至于没有可用的内存,甚至我倾向于先猜测数据库。