如何检测某个进程是否由于超出限制而被 cgroup 杀死?

如何检测某个进程是否由于超出限制而被 cgroup 杀死?

我定义了一个全局 cgroup,用于/etc/cgconfig.conf限制内存量。每次用户运行命令时,我都会将该cgexec进程及其子进程添加到受控组中。有时,限制会生效并终止用户进程。

如果退出代码不是0,我如何知道该进程是否因为某些内部逻辑而失败,或者是否已被 cgroup 机制终止?

它在用户空间中运行,所以我想避免解析/var/log/syslog

答案1

/var/log/kern.log会告诉你。在这种情况下,它会记录在 docker 的 cgroups 中运行的进程的死亡,而 docker 的 cgroups 位于 LXC 的 cgroups 中。

Jan 21 12:32:59 server-hostname kernel: [5808332.413137] oom_reaper: reaped process 32190 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 21 17:28:27 server-hostname kernel: [5826415.492483] python invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Jan 21 17:28:27 server-hostname kernel: [5826415.492484] python cpuset=0efdc3f815c9d3755525ecfa5bbd40829a5724d4aafb39aedb909f7af23e2d3a mems_allowed=0-1
Jan 21 17:28:27 server-hostname kernel: [5826415.492489] CPU: 9 PID: 16369 Comm: python Tainted: P           O      4.18.0-11-generic #12-Ubuntu
Jan 21 17:28:27 server-hostname kernel: [5826415.492490] Hardware name: SuperHardware SYS-112999LX-LC2-S18/X176D7U-CXL, BIOS 2.9d 04/12/2018
Jan 21 17:28:27 server-hostname kernel: [5826415.492491] Call Trace:
Jan 21 17:28:27 server-hostname kernel: [5826415.492500]  dump_stack+0x63/0x83
Jan 21 17:28:27 server-hostname kernel: [5826415.492504]  dump_header+0x71/0x278
Jan 21 17:28:27 server-hostname kernel: [5826415.492506]  oom_kill_process.cold.26+0xb/0x386
Jan 21 17:28:27 server-hostname kernel: [5826415.492507]  out_of_memory+0x1ba/0x4b0
Jan 21 17:28:27 server-hostname kernel: [5826415.492511]  mem_cgroup_out_of_memory+0x4b/0x80
Jan 21 17:28:27 server-hostname kernel: [5826415.492513]  mem_cgroup_oom_synchronize+0x31d/0x350
Jan 21 17:28:27 server-hostname kernel: [5826415.492514]  ? mem_cgroup_swappiness_read+0x40/0x40
Jan 21 17:28:27 server-hostname kernel: [5826415.492516]  pagefault_out_of_memory+0x36/0x7b
Jan 21 17:28:27 server-hostname kernel: [5826415.492521]  mm_fault_error+0x8c/0x150
Jan 21 17:28:27 server-hostname kernel: [5826415.492525]  ? handle_mm_fault+0xe1/0x210
Jan 21 17:28:27 server-hostname kernel: [5826415.492527]  __do_page_fault+0x4a1/0x4d0
Jan 21 17:28:27 server-hostname kernel: [5826415.492528]  do_page_fault+0x2e/0xe0
Jan 21 17:28:27 server-hostname kernel: [5826415.492531]  ? page_fault+0x8/0x30
Jan 21 17:28:27 server-hostname kernel: [5826415.492532]  page_fault+0x1e/0x30
Jan 21 17:28:27 server-hostname kernel: [5826415.492534] RIP: 0033:0x4a9180
Jan 21 17:28:27 server-hostname kernel: [5826415.492534] Code: Bad RIP value.
Jan 21 17:28:27 server-hostname kernel: [5826415.492539] RSP: 002b:00007ffffed1c2d8 EFLAGS: 00010246
Jan 21 17:28:27 server-hostname kernel: [5826415.492540] RAX: 00007f894175fc30 RBX: 00000000019e20b8 RCX: 0000000000000002
Jan 21 17:28:27 server-hostname kernel: [5826415.492541] RDX: 00007f894a35a330 RSI: 0000000000000001 RDI: 00007f894a35a350
Jan 21 17:28:27 server-hostname kernel: [5826415.492541] RBP: 0000000001ce72a0 R08: 00000000008f9920 R09: 0000000000000000
Jan 21 17:28:27 server-hostname kernel: [5826415.492542] R10: 00007f8942840d40 R11: 0000000000000000 R12: 00007f894175fc30
Jan 21 17:28:27 server-hostname kernel: [5826415.492542] R13: 00007f894a35a350 R14: 00000000008f9920 R15: 0000000001ce74a0
Jan 21 17:28:27 server-hostname kernel: [5826415.492543] Task in /lxc/lxcname/docker/0efdc3f815c9d3755525ecfa5bbd40829a5724d4aafb39aedb909f7af23e2d3a killed as a result of limit of /lxc/lxcname/docker/0efdc3f815c9d3755525ecfa5bbd40829a5724d4aafb39aedb909f7af23e2d3a
Jan 21 17:28:27 server-hostname kernel: [5826415.492548] memory: usage 6291456kB, limit 6291456kB, failcnt 96125
Jan 21 17:28:27 server-hostname kernel: [5826415.492549] memory+swap: usage 6291456kB, limit 12582912kB, failcnt 0
Jan 21 17:28:27 server-hostname kernel: [5826415.492549] kmem: usage 17728kB, limit 9007199254740988kB, failcnt 0
Jan 21 17:28:27 server-hostname kernel: [5826415.492550] Memory cgroup stats for /lxc/lxcname/docker/0efdc3f815c9d3755525ecfa5bbd40829a5724d4aafb39aedb909f7af23e2d3a: cache:0KB rss:6272240KB rss_huge:0KB shmem:132KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:64KB active_anon:6273580KB inactive_file:0KB active_file:0KB unevictable:0KB
Jan 21 17:28:27 server-hostname kernel: [5826415.492558] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Jan 21 17:28:27 server-hostname kernel: [5826415.492687] [16369]  5000 16369  1580516  1568393 12701696        0             0 python
Jan 21 17:28:27 server-hostname kernel: [5826415.492873] Memory cgroup out of memory: Kill process 16369 (python) score 999 or sacrifice child
Jan 21 17:28:27 server-hostname kernel: [5826415.502126] Killed process 16369 (python) total-vm:6322064kB, anon-rss:6273396kB, file-rss:176kB, shmem-rss:0kB
Jan 21 17:28:27 server-hostname kernel: [5826415.767023] oom_reaper: reaped process 16369 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

答案2

几年前,我曾进行过一系列实验来回答你提出的同一个问题,我的实验表明,被终止的进程的退出代码始终为 137(即 128 + 9,其中128 是 POSIX 对终止执行的要求9 是 SIGKILL [终止信号] 的整数代码。不幸的是,我无法找到方法来确认它确实是 SIGKILL,而不仅仅是用户报告的退出代码exit(137)/return 137;

相关内容