看门狗:BUG:软锁定 - CPU#53 卡住 22 秒!

看门狗:BUG:软锁定 - CPU#53 卡住 22 秒!

我的 Ubuntu 20.04 服务器开始抱怨:

[138070.784987] watchdog: BUG: soft lockup - CPU#53 stuck for 22s! [kswapd4:543]
[138070.784999] Modules linked in: ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs cpuid rpcsec_gss_krb5 nfsv4 nfs fscache vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) zstd z3fold binfmt_misc zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) dm_multipath scsi_dh_rdac zlua(PO) scsi_dh_emc scsi_dh_alua dcdbas ipmi_ssif joydev input_leds amd64_edac_mod edac_mce_amd kvm_amd ccp kvm serio_raw ipmi_si fam15h_power ipmi_devintf k10temp ipmi_msghandler mac_hid acpi_power_meter sch_fq_codel nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 btrfs zstd_compress nls_iso8859_1 dm_crypt raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mgag200 drm_vram_helper i2c_algo_bit ttm crct10dif_pclmul crc32_pclmul drm_kms_helper ghash_clmulni_intel syscopyarea hid_generic sysfillrect aesni_intel sysimgblt usbhid crypto_simd uas fb_sys_fops cryptd usb_storage psmouse hid megaraid_sas drm glue_helper
[138070.785082]  i2c_piix4 bnx2
[138070.785089] CPU: 53 PID: 543 Comm: kswapd4 Tainted: P           OEL    5.4.0-58-generic #64-Ubuntu
[138070.785091] Hardware name: Dell Inc. PowerEdge R815/04Y8PT, BIOS 3.4.1 05/04/2018
[138070.785100] RIP: 0010:_raw_spin_trylock+0x24/0x30
[138070.785105] Code: c3 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 8b 07 85 c0 75 12 ba 01 00 00 00 f0 0f b1 17 75 07 b8 01 00 00 00 5d c3 31 c0 5d <c3> 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 8b
[138070.785107] RSP: 0018:ffffa0354d7e78b0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[138070.785111] RAX: 0000000000000000 RBX: ffff8b805d5749c0 RCX: 0000000000000014
[138070.785113] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffc00540766460
[138070.785114] RBP: ffffa0354d7e7928 R08: 0000000000d80000 R09: ffffc00540766460
[138070.785116] R10: 000000000000001e R11: 0000000000000001 R12: 0000000000000001
[138070.785118] R13: ffff8b805d5749c8 R14: ffffc00540766450 R15: 0000000000000000
[138070.785121] FS:  0000000000000000(0000) GS:ffff8ba05fb40000(0000) knlGS:0000000000000000
[138070.785123] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[138070.785124] CR2: 000056387e7d72e8 CR3: 000000606600a000 CR4: 00000000000406e0
[138070.785126] Call Trace:
[138070.785137]  ? z3fold_alloc+0xe0/0x920 [z3fold]
[138070.785144]  z3fold_zpool_malloc+0xe/0x10 [z3fold]
[138070.785149]  zpool_malloc+0x1c/0x20
[138070.785155]  zswap_frontswap_store+0x388/0x5ef
[138070.785159]  __frontswap_store+0x73/0x100
[138070.785162]  swap_writepage+0x4b/0x90
[138070.785166]  shmem_writepage+0x1a9/0x300
[138070.785172]  pageout.isra.0+0x11e/0x350
[138070.785175]  shrink_page_list+0x95b/0xbb0
[138070.785179]  shrink_inactive_list+0x201/0x3e0
[138070.785183]  shrink_node_memcg+0x137/0x370
[138070.785188]  ? ip_mc_init_dev+0x50/0xb0
[138070.785192]  ? __switch_to_asm+0x40/0x70
[138070.785196]  ? __switch_to_asm+0x40/0x70
[138070.785199]  shrink_node+0xbd/0x410
[138070.785203]  balance_pgdat+0x319/0x590
[138070.785207]  kswapd+0x1f8/0x3c0
[138070.785211]  ? wait_woken+0x80/0x80
[138070.785215]  kthread+0x104/0x140
[138070.785217]  ? balance_pgdat+0x590/0x590
[138070.785220]  ? kthread_park+0x90/0x90
[138070.785224]  ret_from_fork+0x22/0x40

有些事情仍然运行良好,但似乎只要一个进程需要东西/proc(例如/proc/swaps/proc/locks),那么该进程就会锁定。

z3fold_alloc这让我认为内存不足。对于内存紧张的情况,服务器运行:

echo 1 > /sys/module/zswap/parameters/enabled
echo z3fold > /sys/module/zswap/parameters/zpool
echo 50 > /sys/module/zswap/parameters/max_pool_percent
echo zstd > /sys/module/zswap/parameters/compressor
echo 2 > /proc/sys/vm/overcommit_memory
echo 100 > /proc/sys/vm/overcommit_ratio

但还有充足的内存可用:

$ free -g
              total        used        free      shared  buff/cache   available
Mem:            503         243          67          16         192         240
Swap:           296         261          34

有什么方法可以避免重启吗?例如将 CPU#53 踢出离线状态?

$ uname -a
Linux r815 5.4.0-58-generic #64-Ubuntu SMP Wed Dec 9 08:16:25 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   48 bits physical, 48 bits virtual
CPU(s):                          64
On-line CPU(s) list:             0-63
Thread(s) per core:              2
Core(s) per socket:              8
Socket(s):                       4
NUMA node(s):                    8
Vendor ID:                       AuthenticAMD
CPU family:                      21
Model:                           2
Model name:                      AMD Opteron(tm) Processor 6376
Stepping:                        0
CPU MHz:                         1399.921
BogoMIPS:                        4599.74
Virtualization:                  AMD-V
L1d cache:                       512 KiB
L1i cache:                       2 MiB
L2 cache:                        64 MiB
L3 cache:                        48 MiB
NUMA node0 CPU(s):               0-7
NUMA node1 CPU(s):               8-15
NUMA node2 CPU(s):               32-39
NUMA node3 CPU(s):               40-47
NUMA node4 CPU(s):               48-55
NUMA node5 CPU(s):               56-63
NUMA node6 CPU(s):               16-23
NUMA node7 CPU(s):               24-31
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 
                                 and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer 
                                 sanitization
Vulnerability Spectre v2:        Mitigation; Full AMD retpoline, IBPB conditional, STIBP 
                                 disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
                                  cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx 
                                 mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good 
                                 nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pn
                                 i pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt 
                                 aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_le
                                 gacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop sk
                                 init wdt fma4 tce nodeid_msr tbm topoext perfctr_core pe
                                 rfctr_nb cpb hw_pstate ssbd ibpb vmmcall bmi1 arat npt l
                                 brv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid 
                                 decodeassists pausefilter pfthreshold

相关内容