我正在处理影响 AWS 中数百台机器的内存问题。它们是具有 120GB RAM 的 i3.4xl 实例。 Docker 容器内运行的 Java 数据库服务器消耗了大部分 RAM,但我观察到报告的“已用内存”指标远高于该进程的使用情况。
~$ free -m
total used free shared buff/cache available
Mem: 122878 105687 11608 1221 5583 14285
Swap: 0 0 0
这是顶部的快照。在 108GB 已用内存中,数据库仅占用 77GB。
top - 18:21:25 up 310 days, 15:48, 1 user, load average: 23.78, 20.90, 24.30
Tasks: 284 total, 2 running, 282 sleeping, 0 stopped, 0 zombie
%Cpu(s): 10.4 us, 3.9 sy, 0.0 ni, 83.5 id, 0.2 wa, 0.0 hi, 0.9 si, 1.1 st
KiB Mem : 12582788+total, 7280872 free, 10839378+used, 10153232 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 14414696 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
45338 root 20 0 88.304g 0.077t 24460 S 396.7 65.9 4962:44 java
1353 consul 20 0 53784 30068 0 S 1.3 0.0 10030:05 consul
82080 root 24 4 979740 46128 8548 S 1.3 0.0 6:46.95 aws
2941 dd-agent 20 0 194848 23548 3068 S 1.0 0.0 1293:05 python
83 root 20 0 0 0 0 S 0.3 0.0 290:30.49 ksoftirqd/15
503 root 20 0 147352 98228 87492 S 0.3 0.1 994:49.08 systemd-journal
964 root 20 0 0 0 0 S 0.3 0.0 1031:29 xfsaild/nvme0n1
1405 root 20 0 1628420 48796 16588 S 0.3 0.0 533:50.58 dockerd
2963 dd-agent 20 0 4184188 241520 1196 S 0.3 0.2 168:24.64 java
28797 xray 20 0 3107132 236288 4724 S 0.3 0.2 150:04.44 xray
116185 root 20 0 1722788 13012 6348 S 0.3 0.0 53:54.38 amazon-ssm-agen
1 root 20 0 38728 6144 3308 S 0.0 0.0 2:41.84 systemd
2 root 20 0 0 0 0 S 0.0 0.0 399:59.14 kthreadd
和/proc/meminfo:
~# cat /proc/meminfo
MemTotal: 125827888 kB
MemFree: 5982300 kB
MemAvailable: 14354644 kB
Buffers: 2852 kB
Cached: 9269636 kB
SwapCached: 0 kB
Active: 86468892 kB
Inactive: 6778036 kB
Active(anon): 83977260 kB
Inactive(anon): 1259020 kB
Active(file): 2491632 kB
Inactive(file): 5519016 kB
Unevictable: 3660 kB
Mlocked: 3660 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 220968 kB
Writeback: 0 kB
AnonPages: 83978456 kB
Mapped: 182596 kB
Shmem: 1259060 kB
Slab: 2122036 kB
SReclaimable: 1131528 kB
SUnreclaim: 990508 kB
KernelStack: 48416 kB
PageTables: 183468 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 62913944 kB
Committed_AS: 89880700 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 28672 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 4868096 kB
DirectMap2M: 120961024 kB
DirectMap1G: 4194304 kB
我之前做了一些更改,尝试通过设置更积极地回收平板缓存:
~# cat /proc/sys/vm/vfs_cache_pressure
1000
/proc/meminfo 中的平板内存过去报告为 15GB+,但现在保持在 2GB 左右。这是slabtop 输出(在编辑中添加,在下面的删除缓存之后的一段时间,当内存再次开始填满时):
Active / Total Objects (% used) : 7068193 / 7395845 (95.6%)
Active / Total Slabs (% used) : 158330 / 158330 (100.0%)
Active / Total Caches (% used) : 81 / 128 (63.3%)
Active / Total Size (% used) : 2121875.02K / 2188049.35K (97.0%)
Minimum / Average / Maximum Object : 0.01K / 0.29K / 8.00K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
1465206 1464982 99% 0.38K 35375 42 566000K mnt_cache
1360044 1360044 100% 0.19K 32383 42 259064K dentry
1175936 1107199 94% 0.03K 9187 128 36748K kmalloc-32
1056042 1055815 99% 0.10K 27078 39 108312K buffer_head
732672 727789 99% 1.06K 24606 30 787392K xfs_inode
462213 453665 98% 0.15K 8721 53 69768K xfs_ili
333284 333250 99% 0.57K 6032 56 193024K radix_tree_node
173056 117508 67% 0.06K 2704 64 10816K kmalloc-64
90336 31039 34% 0.12K 1414 64 11312K kmalloc-128
82656 23185 28% 0.19K 1972 42 15776K kmalloc-192
58328 40629 69% 0.50K 1012 64 32384K kmalloc-512
51476 51476 100% 0.12K 758 68 6064K kernfs_node_cache
45440 15333 33% 0.25K 713 64 11408K kmalloc-256
21250 21250 100% 0.05K 250 85 1000K ftrace_event_field
20706 20298 98% 0.04K 203 102 812K ext4_extent_status
19779 18103 91% 0.55K 347 57 11104K inode_cache
18600 18600 100% 0.61K 363 52 11616K proc_inode_cache
14800 13964 94% 0.20K 371 40 2968K vm_area_struct
14176 6321 44% 1.00K 443 32 14176K kmalloc-1024
12006 12006 100% 0.09K 261 46 1044K trace_event_file
11776 11776 100% 0.01K 23 512 92K kmalloc-8
然而,slab 缓存仍然导致内存使用量过高。我相信这一点,因为我可以放弃它:
~# echo 2 > /proc/sys/vm/drop_caches
然后你会看到这个:
~# free -m
total used free shared buff/cache available
Mem: 122878 82880 36236 1245 3761 37815
Swap: 0 0 0
尽管 /proc/meminfo 中仅显示 2GB 内存,但通过删除平板缓存释放了超过 20GB 的内存。这是新的 /proc/meminfo:
~# cat /proc/meminfo
MemTotal: 125827888 kB
MemFree: 34316592 kB
MemAvailable: 38394188 kB
Buffers: 6652 kB
Cached: 5726320 kB
SwapCached: 0 kB
Active: 85651988 kB
Inactive: 4084612 kB
Active(anon): 84007364 kB
Inactive(anon): 1283596 kB
Active(file): 1644624 kB
Inactive(file): 2801016 kB
Unevictable: 3660 kB
Mlocked: 3660 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 260096 kB
Writeback: 0 kB
AnonPages: 84008564 kB
Mapped: 194628 kB
Shmem: 1283636 kB
Slab: 601176 kB
SReclaimable: 401788 kB
SUnreclaim: 199388 kB
KernelStack: 48496 kB
PageTables: 183564 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 62913944 kB
Committed_AS: 89815920 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 28672 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 4868096 kB
DirectMap2M: 120961024 kB
DirectMap1G: 4194304 kB
所以我想我的问题是为什么会发生这种情况以及如何防止slab缓存对已用内存影响太大?它最终会导致 Java 服务器被 oom-killed,但也可能会阻止页面缓存变得更有效。 Java服务器确实接触了文件系统(XFS)中的大量(数百万)文件,因此这可能是相关的,但我不明白为什么报告的指标不一致。这再次影响了数百台以相同方式配置的机器。
任何帮助将不胜感激,谢谢!