我的 Ubuntu 20.04 服务器开始抱怨:
[138070.784987] watchdog: BUG: soft lockup - CPU#53 stuck for 22s! [kswapd4:543]
[138070.784999] Modules linked in: ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs cpuid rpcsec_gss_krb5 nfsv4 nfs fscache vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) zstd z3fold binfmt_misc zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) dm_multipath scsi_dh_rdac zlua(PO) scsi_dh_emc scsi_dh_alua dcdbas ipmi_ssif joydev input_leds amd64_edac_mod edac_mce_amd kvm_amd ccp kvm serio_raw ipmi_si fam15h_power ipmi_devintf k10temp ipmi_msghandler mac_hid acpi_power_meter sch_fq_codel nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 btrfs zstd_compress nls_iso8859_1 dm_crypt raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mgag200 drm_vram_helper i2c_algo_bit ttm crct10dif_pclmul crc32_pclmul drm_kms_helper ghash_clmulni_intel syscopyarea hid_generic sysfillrect aesni_intel sysimgblt usbhid crypto_simd uas fb_sys_fops cryptd usb_storage psmouse hid megaraid_sas drm glue_helper
[138070.785082] i2c_piix4 bnx2
[138070.785089] CPU: 53 PID: 543 Comm: kswapd4 Tainted: P OEL 5.4.0-58-generic #64-Ubuntu
[138070.785091] Hardware name: Dell Inc. PowerEdge R815/04Y8PT, BIOS 3.4.1 05/04/2018
[138070.785100] RIP: 0010:_raw_spin_trylock+0x24/0x30
[138070.785105] Code: c3 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 8b 07 85 c0 75 12 ba 01 00 00 00 f0 0f b1 17 75 07 b8 01 00 00 00 5d c3 31 c0 5d <c3> 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 8b
[138070.785107] RSP: 0018:ffffa0354d7e78b0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[138070.785111] RAX: 0000000000000000 RBX: ffff8b805d5749c0 RCX: 0000000000000014
[138070.785113] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffc00540766460
[138070.785114] RBP: ffffa0354d7e7928 R08: 0000000000d80000 R09: ffffc00540766460
[138070.785116] R10: 000000000000001e R11: 0000000000000001 R12: 0000000000000001
[138070.785118] R13: ffff8b805d5749c8 R14: ffffc00540766450 R15: 0000000000000000
[138070.785121] FS: 0000000000000000(0000) GS:ffff8ba05fb40000(0000) knlGS:0000000000000000
[138070.785123] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[138070.785124] CR2: 000056387e7d72e8 CR3: 000000606600a000 CR4: 00000000000406e0
[138070.785126] Call Trace:
[138070.785137] ? z3fold_alloc+0xe0/0x920 [z3fold]
[138070.785144] z3fold_zpool_malloc+0xe/0x10 [z3fold]
[138070.785149] zpool_malloc+0x1c/0x20
[138070.785155] zswap_frontswap_store+0x388/0x5ef
[138070.785159] __frontswap_store+0x73/0x100
[138070.785162] swap_writepage+0x4b/0x90
[138070.785166] shmem_writepage+0x1a9/0x300
[138070.785172] pageout.isra.0+0x11e/0x350
[138070.785175] shrink_page_list+0x95b/0xbb0
[138070.785179] shrink_inactive_list+0x201/0x3e0
[138070.785183] shrink_node_memcg+0x137/0x370
[138070.785188] ? ip_mc_init_dev+0x50/0xb0
[138070.785192] ? __switch_to_asm+0x40/0x70
[138070.785196] ? __switch_to_asm+0x40/0x70
[138070.785199] shrink_node+0xbd/0x410
[138070.785203] balance_pgdat+0x319/0x590
[138070.785207] kswapd+0x1f8/0x3c0
[138070.785211] ? wait_woken+0x80/0x80
[138070.785215] kthread+0x104/0x140
[138070.785217] ? balance_pgdat+0x590/0x590
[138070.785220] ? kthread_park+0x90/0x90
[138070.785224] ret_from_fork+0x22/0x40
有些事情仍然运行良好,但似乎只要一个进程需要东西/proc
(例如/proc/swaps
或/proc/locks
),那么该进程就会锁定。
z3fold_alloc
这让我认为内存不足。对于内存紧张的情况,服务器运行:
echo 1 > /sys/module/zswap/parameters/enabled
echo z3fold > /sys/module/zswap/parameters/zpool
echo 50 > /sys/module/zswap/parameters/max_pool_percent
echo zstd > /sys/module/zswap/parameters/compressor
echo 2 > /proc/sys/vm/overcommit_memory
echo 100 > /proc/sys/vm/overcommit_ratio
但还有充足的内存可用:
$ free -g
total used free shared buff/cache available
Mem: 503 243 67 16 192 240
Swap: 296 261 34
有什么方法可以避免重启吗?例如将 CPU#53 踢出离线状态?
$ uname -a
Linux r815 5.4.0-58-generic #64-Ubuntu SMP Wed Dec 9 08:16:25 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 48 bits physical, 48 bits virtual
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 4
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 21
Model: 2
Model name: AMD Opteron(tm) Processor 6376
Stepping: 0
CPU MHz: 1399.921
BogoMIPS: 4599.74
Virtualization: AMD-V
L1d cache: 512 KiB
L1i cache: 2 MiB
L2 cache: 64 MiB
L3 cache: 48 MiB
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
NUMA node2 CPU(s): 32-39
NUMA node3 CPU(s): 40-47
NUMA node4 CPU(s): 48-55
NUMA node5 CPU(s): 56-63
NUMA node6 CPU(s): 16-23
NUMA node7 CPU(s): 24-31
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer
sanitization
Vulnerability Spectre v2: Mitigation; Full AMD retpoline, IBPB conditional, STIBP
disabled, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx
mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good
nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pn
i pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt
aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_le
gacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop sk
init wdt fma4 tce nodeid_msr tbm topoext perfctr_core pe
rfctr_nb cpb hw_pstate ssbd ibpb vmmcall bmi1 arat npt l
brv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid
decodeassists pausefilter pfthreshold