调试 Linux 中突然出现的内存使用量激增

调试 Linux 中突然出现的内存使用量激增

我遇到了 Linux 机器(Rocky Linux 版本 8.9)内存使用量突然达到峰值的问题,正在寻找有关如何调试此问题的指导。下面,我详细介绍了系统在正常运行期间和内存使用量达到峰值时的状态。

我的脚本(循环中)用于调试:

echo "timestamp: $(date)"
echo "-----------"
echo "Free:"
free -h
echo "-----------"
echo "top 30 processes by RAM usage:"
ps -eo pid,comm,%mem --sort=-%mem | head -n 30
echo "-----------"
echo "tmpfs:"
df -h | grep tmpfs
echo "-----------"
echo "pressure:"
cat /proc/pressure/memory
echo "-----------"
echo "processes count"
ps -eo pid --sort=-%mem | wc -l
echo "-----------"
echo "Slab:"
slabtop -o -s c | head -n15
echo "-----------"
echo "meminfo:"
cat /proc/meminfo
echo "-----------"
sleep 10

正常运行状态:

-----------
timestamp: ...
-----------
Free:
              total        used        free      shared  buff/cache   available
Mem:          125Gi        10Gi        89Gi       673Mi        24Gi       112Gi
Swap:            0B          0B          0B
-----------
top 30 processes by RAM usage:
    PID COMMAND         %MEM
   5839 celery           1.2
   5711 celery           1.1
   5838 celery           1.1
2738431 tofu             0.6
2594888 terraform-provi  0.2
2731662 terraform-provi  0.2
   6037 agent            0.2
2593993 terraform        0.1
2654624 terraform        0.1
   3166 dockerd          0.1
2717064 terraform        0.1
2718714 terraform-provi  0.1
2746777 terraform-provi  0.1
2746604 terraform-provi  0.1
2746672 terraform-provi  0.1
2746840 terraform-provi  0.1
2746727 terraform-provi  0.1
2655252 terraform-provi  0.1
    874 systemd-journal  0.1
2655212 terraform-provi  0.1
2731044 terraform        0.1
2759128 opa              0.0
   6046 promtail         0.0
2757823 opa              0.0
   1144 firewalld        0.0
   6020 telegraf         0.0
2759141 terraform        0.0
   2361 otelopscol       0.0
   6038 process-agent    0.0
-----------
tmpfs:
devtmpfs         63G     0   63G   0% /dev
tmpfs            63G  1.4M   63G   1% /dev/shm
tmpfs            63G  673M   63G   2% /run
tmpfs            63G     0   63G   0% /sys/fs/cgroup
tmpfs            13G     0   13G   0% /run/user/1002
-----------
pressure:
some avg10=0.01 avg60=0.04 avg300=0.01 total=244286854
full avg10=0.01 avg60=0.04 avg300=0.01 total=145337516
-----------
processes count
490
-----------
Slab:
 Active / Total Objects (% used)    : 6416975 / 8694321 (73.8%)
 Active / Total Slabs (% used)      : 176631 / 176631 (100.0%)
 Active / Total Caches (% used)     : 168 / 237 (70.9%)
 Active / Total Size (% used)       : 1729357.80K / 2232416.67K (77.5%)
 Minimum / Average / Maximum Object : 0.01K / 0.26K / 10.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
709140 643851  90%    1.06K  23638       30    756416K xfs_inode
1571346 471769  30%    0.19K  37413       42    299304K dentry
278112 257157  92%    1.00K   8691       32    278112K kmalloc-1k
314580 222788  70%    0.57K  11235       28    179760K radix_tree_node
687834 629395  91%    0.19K  16377       42    131016K xfs_ili
234432 111063  47%    0.50K   7326       32    117216K kmalloc-512
779392 701563  90%    0.06K  12178       64     48712K kmalloc-64
760064 697202  91%    0.06K  11876       64     47504K lsm_inode_cache
-----------
meminfo:
MemTotal:       131460012 kB
MemFree:        93953016 kB
MemAvailable:   118221500 kB
Buffers:             920 kB
Cached:         24721888 kB
SwapCached:            0 kB
Active:         19786296 kB
Inactive:       14707240 kB
Active(anon):     365780 kB
Inactive(anon): 10095120 kB
Active(file):   19420516 kB
Inactive(file):  4612120 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:              2684 kB
Writeback:             0 kB
AnonPages:       8249688 kB
Mapped:          1123776 kB
Shmem:            690156 kB
KReclaimable:    1483052 kB
Slab:            2267680 kB
SReclaimable:    1483052 kB
SUnreclaim:       784628 kB
KernelStack:       25584 kB
PageTables:        45756 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    65730004 kB
Committed_AS:   17221104 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      149580 kB
VmallocChunk:          0 kB
Percpu:            31104 kB
HardwareCorrupted:     0 kB
AnonHugePages:   1837056 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:     5311288 kB
DirectMap2M:    125757440 kB
DirectMap1G:     5242880 kB

高峰期:

-----------
timestamp: ....
-----------
Free:
              total        used        free      shared  buff/cache   available
Mem:          125Gi        99Gi       1.0Gi       673Mi        24Gi        24Gi
Swap:            0B          0B          0B
-----------
top 30 processes by RAM usage:
    PID COMMAND         %MEM
   5839 celery           1.2
   5711 celery           1.1
   5838 celery           1.1
2738431 tofu             0.5
2594888 terraform-provi  0.5
2731662 terraform-provi  0.2
   6037 agent            0.2
2593993 terraform        0.1
2717064 terraform        0.1
2654624 terraform        0.1
   3166 dockerd          0.1
2718714 terraform-provi  0.1
2746777 terraform-provi  0.1
2746604 terraform-provi  0.1
2746672 terraform-provi  0.1
2746840 terraform-provi  0.1
2746727 terraform-provi  0.1
2655252 terraform-provi  0.1
    874 systemd-journal  0.1
2655212 terraform-provi  0.1
2731044 terraform        0.1
   6046 promtail         0.0
   1144 firewalld        0.0
   6020 telegraf         0.0
   2361 otelopscol       0.0
   6038 process-agent    0.0
   1184 tuned            0.0
2751882 opa              0.0
   6040 trace-agent      0.0
-----------
tmpfs:
devtmpfs         63G     0   63G   0% /dev
tmpfs            63G  1.4M   63G   1% /dev/shm
tmpfs            63G  673M   63G   2% /run
tmpfs            63G     0   63G   0% /sys/fs/cgroup
tmpfs            13G     0   13G   0% /run/user/1002
-----------
pressure:
some avg10=0.16 avg60=0.09 avg300=0.02 total=244227196
full avg10=0.16 avg60=0.09 avg300=0.02 total=145285775
-----------
processes count
496
-----------
Slab:
 Active / Total Objects (% used)    : 6381059 / 8697291 (73.4%)
 Active / Total Slabs (% used)      : 176611 / 176611 (100.0%)
 Active / Total Caches (% used)     : 168 / 237 (70.9%)
 Active / Total Size (% used)       : 1721743.55K / 2232651.09K (77.1%)
 Minimum / Average / Maximum Object : 0.01K / 0.26K / 10.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
709380 644858  90%    1.06K  23646       30    756672K xfs_inode
1572774 462962  29%    0.19K  37447       42    299576K dentry
278336 256306  92%    1.00K   8698       32    278336K kmalloc-1k
314580 223186  70%    0.57K  11235       28    179760K radix_tree_node
687834 631037  91%    0.19K  16377       42    131016K xfs_ili
234432 112304  47%    0.50K   7326       32    117216K kmalloc-512
780544 704162  90%    0.06K  12196       64     48784K kmalloc-64
760640 692818  91%    0.06K  11885       64     47540K lsm_inode_cache
-----------
meminfo:
MemTotal:       131460012 kB
MemFree:         1098564 kB
MemAvailable:   25244524 kB
Buffers:             920 kB
Cached:         24607936 kB
SwapCached:            0 kB
Active:         19206988 kB
Inactive:       15545632 kB
Active(anon):     365772 kB
Inactive(anon): 10468576 kB
Active(file):   18841216 kB
Inactive(file):  5077056 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:             13560 kB
Writeback:             0 kB
AnonPages:       8674264 kB
Mapped:          1116760 kB
Shmem:            690148 kB
KReclaimable:    1483084 kB
Slab:            2267544 kB
SReclaimable:    1483084 kB
SUnreclaim:       784460 kB
KernelStack:       24960 kB
PageTables:        45932 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    65730004 kB
Committed_AS:   18166532 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      149076 kB
VmallocChunk:          0 kB
Percpu:            31104 kB
HardwareCorrupted:     0 kB
AnonHugePages:   2279424 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:     5311288 kB
DirectMap2M:    125757440 kB
DirectMap1G:     5242880 kB

问题:

我可以使用哪些工具或技术来进一步诊断内存使用量的突然增加?

附加信息:

“按 RAM 使用率排名前 30 的进程:”部分中提到的进程,例如 terraform,... 是 docker 容器中的进程。所有容器都有限制,容器的平均数量在正常状态和峰值期间不会发生变化。所以,我不认为这是 docker 容器的问题。


概括:

使用类似的工具,htop, ps, meminfo我无法找到消耗内存的内容,并且需要帮助。

谢谢!

相关内容