我试图理解为什么当服务器上有足够的可用内存时,OOM 杀手会开始终止进程:
uname -a 的结果:
Linux hostname 2.6.32.43-0.4.1.xs1.8.0.835.170778xen #1 SMP Wed May 29 18:06:30 EDT 2013 i686 i686 i386 GNU/Linux
以下是当时 /var/log/messages 文件的输出:
Oct 19 10:59:13 hostname kernel: [86864613.667317] DMA free:2884kB min:76kB low:92kB high:112kB active_anon:0kB inactive_anon:0kB active_file:4kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:16256kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:112kB slab_unreclaimable:4968kB kernel_stack:1616kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Oct 19 10:59:13 hostname kernel: [86864613.667329] lowmem_reserve[]: 0 699 4021 4021
Oct 19 10:59:13 hostname kernel: [86864613.667337] Normal free:11300kB min:3424kB low:4280kB high:5136kB active_anon:0kB inactive_anon:0kB active_file:116kB inactive_file:104kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:715992kB mlocked:0kB dirty:0kB writeback:0kB mapped:180kB shmem:0kB slab_reclaimable:10700kB slab_unreclaimable:560156kB kernel_stack:2928kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:32 all_unreclaimable? no
Oct 19 10:59:13 hostname2 kernel: [86864613.667350] lowmem_reserve[]: 0 0 26574 26574
Oct 19 10:59:13 hostname kernel: [86864613.667357] HighMem free:2983676kB min:512kB low:4564kB high:8616kB active_anon:224476kB inactive_anon:69692kB active_file:47640kB inactive_file:55096kB unevictable:38204kB isolated(anon):0kB isolated(file):0kB present:3401572kB mlocked:38204kB dirty:32kB writeback:0kB mapped:36896kB shmem:2716kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Oct 19 10:59:13 hostname kernel: [86864613.667370] lowmem_reserve[]: 0 0 0 0
Oct 19 10:59:13 hostname kernel: [86864613.667375] DMA: 691*4kB 8*8kB 6*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2924kB
Oct 19 10:59:13 hostname kernel: [86864613.667386] Normal: 2751*4kB 37*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 11300kB
Oct 19 10:59:13 hostname kernel: [86864613.667397] HighMem: 26993*4kB 32975*8kB 16252*16kB 5804*32kB 1288*64kB 227*128kB 108*256kB 51*512kB 38*1024kB 14*2048kB 472*4096kB = 2983676kB
Oct 19 10:59:13 hostname kernel: [86864613.667410] 27660 total pagecache pages
Oct 19 10:59:13 hostname kernel: [86864613.667412] 0 pages in swap cache
Oct 19 10:59:13 hostname kernel: [86864613.667415] Swap cache stats: add 0, delete 0, find 0/0
Oct 19 10:59:13 hostname kernel: [86864613.667417] Free swap = 524280kB
Oct 19 10:59:13 hostname kernel: [86864613.667419] Total swap = 524280kB
Oct 19 10:59:13 hostname kernel: [86864613.674877] 1050624 pages RAM
Oct 19 10:59:13 hostname kernel: [86864613.674885] 857090 pages HighMem
Oct 19 10:59:13 hostname kernel: [86864613.674887] 39051 pages reserved
Oct 19 10:59:13 hostname kernel: [86864613.674892] 74281 pages shared
Oct 19 10:59:13 hostname kernel: [86864613.674894] 235220 pages non-shared
Oct 19 10:59:13 hostname kernel: [86864613.674898] Out of memory: kill process 1729 (fe) score 52596 or a child
Oct 19 10:59:13 hostname kernel: [86864613.674902] Killed process 1730 (xapi)
Oct 19 10:59:13 hostname mpathalert: [error|hostname|1||http] Failed to parse HTTP response status line []
Oct 19 10:59:13 hostname xapi: [ info|hostname|0 thread_zero||watchdog] received signal: SIGKILL
Oct 19 10:59:13 hostname xapi: [ info|hostname|0 thread_zero||watchdog] xapi watchdog exiting.
Oct 19 10:59:13 hostname xapi: [ info|hostname|0 thread_zero||watchdog] Fatal: xapi died with signal -7: not restarting (watchdog never restarts on this signal)
free -m 的输出:
total used free shared buffers cached
Mem: 4069 1236 2832 0 5 228
-/+ buffers/cache: 1002 3066
Swap: 511 0 511
如您所见,有大量可用内存。我该如何调查为什么它会终止具有足够可用内存的进程?
此外,如果内存不足,我该如何检查哪个进程导致内存不足?
答案1
在进程被终止后查看 free -m 输出没有多大意义。
输出是动态的,如果它显示您当前没有内存短缺,这并不意味着在触发 OOM 条件时您没有内存,
我建议阅读这篇关于配置 OOM 终止程序的优秀 Oracle 文章http://www.oracle.com/technetwork/articles/servers-storage-dev/oom-killer-1911807.html
在本文中,他们指出了一种完全排除 pid 导致 OOM 的方法,尽管不建议这样做。
如果这种情况经常发生在您的环境中,我会将您的 sar 调整为每分钟运行一次,而不是默认的每 10 分钟运行一次,然后您就可以更好地了解内存消耗动态。