每日 Apt 升级导致 oom-killer 杀死我的 java 服务器

每日 Apt 升级导致 oom-killer 杀死我的 java 服务器

我的服务器半夜因为这个原因死机了好几次。apt-upgrade 需要多少内存?合理吗?我正在尝试解读这里的日志。任何帮助或建议都非常感谢!

具有 1GB 内存的 Amazon EC2 服务器。Ubuntu 18.04.5 LTS (Bionic Beaver)

Mar 25 06:46:01 ip-172-31-30-204 systemd[1]: Starting Daily apt upgrade and clean activities...
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631907] snapd invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-900
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631911] CPU: 1 PID: 31093 Comm: snapd Not tainted 5.3.0-1032-aws #34~18.04.2-Ubuntu
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631912] Hardware name: Amazon EC2 t3.micro/, BIOS 1.0 10/16/2017
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631912] Call Trace:
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631921]  dump_stack+0x6d/0x95
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631925]  dump_header+0x4f/0x200
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631927]  oom_kill_process+0xe6/0x120
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631929]  out_of_memory+0x109/0x510
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631931]  __alloc_pages_slowpath+0xad1/0xe10
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631934]  ? __switch_to_asm+0x34/0x70
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631936]  ? __switch_to_asm+0x40/0x70
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631938]  __alloc_pages_nodemask+0x2cd/0x320
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631942]  alloc_pages_current+0x6a/0xe0
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631944]  __page_cache_alloc+0x6a/0xa0
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631946]  pagecache_get_page+0x9c/0x2b0
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631947]  filemap_fault+0x66d/0xb60
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631950]  ? page_add_file_rmap+0x5e/0x150
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631952]  ? alloc_set_pte+0x113/0x5f0
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631956]  ? xas_load+0xc/0x80
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631957]  ? xas_find+0x16f/0x1b0
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631959]  ? filemap_map_pages+0x18f/0x380
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631961]  __do_fault+0x57/0x110
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631963]  __handle_mm_fault+0xdd8/0x1260
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631965]  handle_mm_fault+0xcb/0x210
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631969]  __do_page_fault+0x2a1/0x4d0
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631971]  do_page_fault+0x2c/0xe0
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631974]  do_async_page_fault+0x54/0x70
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631977]  async_page_fault+0x34/0x40
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631979] RIP: 0033:0x55cf079179d8
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631983] Code: Bad RIP value.
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631984] RSP: 002b:000000c420063b60 EFLAGS: 00010283
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631986] RAX: 000055cf08c6f5a0 RBX: 0000000000000000 RCX: 000055cf0869de80
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631987] RDX: 0000000000000018 RSI: 000000c420606068 RDI: 000000c420606078
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631988] RBP: 000000c420063b80 R08: 000000c420063b68 R09: 00000000000000ff
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631989] R10: 0000000000000001 R11: 000000c4204f9440 R12: 0000000000000001
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631990] R13: 0000000000000040 R14: 00000000000000d9 R15: 0000000000000000
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631992] Mem-Info:
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631995] active_anon:199193 inactive_anon:41 isolated_anon:0
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631995]  active_file:39 inactive_file:31 isolated_file:0
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631995]  unevictable:0 dirty:2 writeback:0 unstable:0
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631995]  slab_reclaimable:8066 slab_unreclaimable:12259
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631995]  mapped:13 shmem:190 pagetables:1921 bounce:0
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631995]  free:12108 free_pcp:452 free_cma:0
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631998] Node 0 active_anon:796772kB inactive_anon:164kB active_file:156kB inactive_file:124kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:52kB dirty:8kB writeback:0kB shmem:760kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.631999] Node 0 DMA free:4340kB min:756kB low:944kB high:1132kB active_anon:11088kB inactive_anon:8kB active_file:64kB inactive_file:8kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB kernel_stack:0kB pagetables:48kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632003] lowmem_reserve[]: 0 909 909 909 909
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632005] Node 0 DMA32 free:44092kB min:44296kB low:55368kB high:66440kB active_anon:785684kB inactive_anon:156kB active_file:52kB inactive_file:628kB unevictable:0kB writepending:8kB present:1003496kB managed:960336kB mlocked:0kB kernel_stack:4208kB pagetables:7636kB bounce:0kB free_pcp:1808kB local_pcp:396kB free_cma:0kB
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632009] lowmem_reserve[]: 0 0 0 0 0
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632011] Node 0 DMA: 33*4kB (UME) 25*8kB (UME) 13*16kB (UME) 15*32kB (UME) 13*64kB (UME) 4*128kB (UE) 4*256kB (UME) 2*512kB (UM) 0*1024kB 0*2048kB 0*4096kB = 4412kB
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632019] Node 0 DMA32: 1212*4kB (UME) 689*8kB (UME) 764*16kB (UME) 317*32kB (UME) 133*64kB (UME) 25*128kB (UME) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44440kB
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632027] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632028] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632029] 285 total pagecache pages
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632030] 0 pages in swap cache
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632031] Swap cache stats: add 0, delete 0, find 0/0
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632032] Free swap  = 0kB
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632032] Total swap = 0kB
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632033] 254872 pages RAM
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632034] 0 pages HighMem/MovableOnly
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632034] 10811 pages reserved
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632035] 0 pages cma reserved
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632035] 0 pages hwpoisoned
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632036] Tasks state (memory values in pages):
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632036] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632042] [    389]     0   389    32672     2976   282624        0             0 systemd-journal
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632045] [    408]     0   408    11118      584   118784        0         -1000 systemd-udevd
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632047] [    432]     0   432    26475       60    98304        0             0 lvmetad
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632049] [    554] 62583   554    35488      160   184320        0             0 systemd-timesyn
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632051] [    713]   100   713    20051      187   180224        0             0 systemd-network
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632053] [    747]   101   747    17696      183   180224        0             0 systemd-resolve
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632054] [    890]     0   890    17845      393   180224        0             0 systemd-logind
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632056] [    896]   102   896    66816      448   167936        0             0 rsyslogd
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632058] [    897]     0   897    27603       82   118784        0             0 irqbalance
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632060] [    898]   103   898    12580      234   139264        0          -900 dbus-daemon
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632062] [    919]     0   919     1137       15    61440        0             0 acpid
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632064] [    922]     0   922     7082       52   106496        0             0 atd
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632065] [    923]     0   923     7961       75   102400        0             0 cron
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632067] [    924]     0   924   206403      286   163840        0             0 lxcfs
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632069] [    927]     0   927    42771     1978   233472        0             0 networkd-dispat
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632071] [    931]     0   931    18074      190   180224        0         -1000 sshd
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632073] [    937]     0   937    72220      242   212992        0             0 polkitd
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632074] [    942]     0   942    46921     1979   253952        0             0 unattended-upgr
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632076] [    947]     0   947     4102       37    77824        0             0 agetty
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632078] [    950]     0   950     3721       33    65536        0             0 agetty
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632080] [   8489]     0  8489    19170      278   188416        0             0 systemd
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632082] [   8490]     0  8490    47880      618   249856        0             0 (sd-pam)
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632084] [  21005]   106 21005     7148       45    98304        0             0 uuidd
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632086] [  22103]     0 22103    71999      237   192512        0             0 accounts-daemon
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632088] [  12048]     0 12048   217770     2487   229376        0             0 amazon-ssm-agen
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632090] [  12165]     0 12165   221541     4204   262144        0             0 ssm-agent-worke
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632091] [  15998]     0 15998    26998      262   245760        0             0 sshd
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632093] [  30993]     0 30993   215444     4093   262144        0          -900 snapd
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632095] [  32272]     0 32272     8069      198    94208        0             0 screen
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632097] [  32274]     0 32274   621975    10325   376832        0             0 java
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632099] [  32306]     0 32306   726790   135577  1568768        0             0 java
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632101] [   6081]     0  6081    18075      183   184320        0             0 sshd
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632103] [   6082]   109  6082    18075      188   176128        0             0 sshd
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632104] [   6086]     0  6086     1156       17    53248        0             0 apt.systemd.dai
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632106] [   6101]     0  6101     1156       35    61440        0             0 apt.systemd.dai
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632108] [   6134]     0  6134    88545    10780   462848        0             0 unattended-upgr
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632109] [   6153]     0  6153    88545    10796   446464        0             0 unattended-upgr
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632111] [   6171]     0  6171    22038    15667   221184        0             0 dpkg
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632113] [   6179]     0  6179     5346       45    81920        0             0 dpkg-deb
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632114] [   6180]     0  6180     5346       45    77824        0             0 dpkg-deb
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632116] [   6181]     0  6181     7395      979    94208        0             0 dpkg-deb
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632117] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice,task=java,pid=32306,uid=0
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.632288] Out of memory: Killed process 32306 (java) total-vm:2907160kB, anon-rss:542308kB, file-rss:0kB, shmem-rss:0kB
Mar 25 06:46:32 ip-172-31-30-204 kernel: [17858349.681431] oom_reaper: reaped process 32306 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Mar 25 06:46:51 ip-172-31-30-204 systemd[1]: Started Daily apt upgrade and clean activities.

答案1

我对 Amazon EC2 不是很了解,但我在运行 Jira(用 Java 编写)的 DigitalOcean 节点上也遇到了类似的问题。

Java(Python、Ada、JavaScript、C#……)等语言的一个主要问题是它们使用所谓的垃圾收集。因此,你永远不需要释放对象,它释放自身一旦所有对它的引用切断

一个问题可能是垃圾收集器不按要求频繁运行,以保持内存量尽可能低。即,如果服务器繁忙,它将在尝试收集内存之前尝试处理请求。这是启动 Java 程序时可以设置的参数之一。

我使用的其他选项记忆-Xms等,它们告诉 Java 允许分配多少内存。当所有内存都分配完毕后,垃圾收集器将被迫运行,以尽可能多地挽救内存,并且 Java 仍可能因错误而停止(即,如果您有泄漏或太多连接同时需要超过允许的内存),但并行运行 APT 不应该最终导致您的 Java 应用程序终止。

最后,为了解决我的问题,除了验证我的参数之外,我所做的就是添加交换文件。这会占用一些磁盘空间,但我认为值得的是不要让您的服务一直消失。但是,它不应该被滥用...如果您一直在交换,最好买一台更大的计算机(将 RAM 增加一倍),否则您的系统会变得非常慢。但如果自动 apt 更新需要大约 15 分钟,那么在那时交换一点而不是崩溃是值得的。

另一个解决方案是偶尔重启您的服务。这样,尚未收集的内存将被强制收集。当然,在您重启服务时尝试连接的任何用户都会收到错误(500 或 503)。

注意:在 APT 自动更新发生时,您可能会获得更多点击。如果是这种情况,您可以考虑更改 APT 进程的运行时间。不过,这很烦人,因为它在 cron 上被定义为“每日”,并且所有每日任务都在给定时间一个接一个地运行。您必须将所有这些进程移到一天中的另一个时间,或者更改 APT 更新的设置方式。

相关内容