我尝试了在 Google 上找到的所有解决方案......但我找不到我的服务器崩溃的原因......
Aug 5 17:11:08 kernel: [ 2300.084576] watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [VM Thread:4054]
Aug 5 17:11:08 kernel: [ 2300.084578] Modules linked in: veth nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo br_netfilter bridge stp llc rpcsec_gss_krb5 auth_rpcgss aufs nfsv4 nfs lockd grace fscache overlay isofs xt_nat xt_MASQUERADE xt_addrtype iptable_nat nf_nat xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ppdev kvm_intel kvm ipmi_si input_leds joydev ipmi_devintf ipmi_msghandler video parport_pc parport acpi_pad sch_fq_codel drm sunrpc ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear raid1 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel hid_generic aesni_intel crypto_simd cryptd glue_helper usbhid igb hid nvme dca ahci i2c_algo_bit nvme_core libahci
Aug 5 17:11:08 kernel: [ 2300.084616] CPU: 6 PID: 4054 Comm: VM Thread Not tainted 5.4.0-42-generic #46-Ubuntu
Aug 5 17:11:08 kernel: [ 2300.084616] Hardware name: Intel Corporation S1200SP/S1200SP, BIOS S1200SP.86B.03.01.0042.013020190050 01/30/2019
Aug 5 17:11:08 kernel: [ 2300.084620] RIP: 0010:_raw_spin_lock+0x10/0x30
Aug 5 17:11:08 kernel: [ 2300.084621] Code: ff 01 00 00 75 07 4c 89 e0 41 5c 5d c3 e8 f8 f9 62 ff 4c 89 e0 41 5c 5d c3 90 0f 1f 44 00 00 31 c0 ba 01 00 00 00 f0 0f b1 17 <75> 01 c3 55 89 c6 48 89 e5 e8 c2 e1 62 ff 66 90 5d c3 66 66 2e 0f
Aug 5 17:11:08 kernel: [ 2300.084621] RSP: 0000:ffffa592c1bef760 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
Aug 5 17:11:08 kernel: [ 2300.084622] RAX: 0000000000000000 RBX: 0000000000000100 RCX: ffff95314b79bc00
Aug 5 17:11:08 kernel: [ 2300.084622] RDX: 0000000000000001 RSI: 0000000000000588 RDI: ffff953145c1aeac
Aug 5 17:11:08 kernel: [ 2300.084623] RBP: ffffa592c1bef7b8 R08: ffff95314a5520f0 R09: 0000000000000000
Aug 5 17:11:08 kernel: [ 2300.084623] R10: 0000000000000000 R11: ffffffffffffffb8 R12: 0000000000000000
Aug 5 17:11:08 kernel: [ 2300.084623] R13: ffff953145c1ae00 R14: ffff95314b79bc00 R15: ffff953145c1aeac
Aug 5 17:11:08 kernel: [ 2300.084624] FS: 00007fa0e4151700(0000) GS:ffff953151580000(0000) knlGS:0000000000000000
Aug 5 17:11:08 kernel: [ 2300.084624] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 5 17:11:08 kernel: [ 2300.084625] CR2: 0000000594832008 CR3: 000000045bf00003 CR4: 00000000003606e0
Aug 5 17:11:08 kernel: [ 2300.084625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 5 17:11:08 kernel: [ 2300.084625] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Aug 5 17:11:08 kernel: [ 2300.084626] Call Trace:
Aug 5 17:11:08 kernel: [ 2300.084628] ? scan_swap_map_slots+0x3cd/0x510
Aug 5 17:11:08 kernel: [ 2300.084629] get_swap_pages+0x207/0x380
Aug 5 17:11:08 kernel: [ 2300.084630] ? rmap_walk_anon+0x16f/0x260
Aug 5 17:11:08 kernel: [ 2300.084632] get_swap_page+0xe3/0x210
Aug 5 17:11:08 kernel: [ 2300.084633] add_to_swap+0x1a/0x70
Aug 5 17:11:08 kernel: [ 2300.084634] shrink_page_list+0x4b3/0xbb0
Aug 5 17:11:08 kernel: [ 2300.084648] shrink_inactive_list+0x201/0x3e0
Aug 5 17:11:08 kernel: [ 2300.084649] shrink_node_memcg+0x137/0x370
Aug 5 17:11:08 kernel: [ 2300.084650] shrink_node+0xbd/0x400
Aug 5 17:11:08 kernel: [ 2300.084650] do_try_to_free_pages+0xd7/0x3a0
Aug 5 17:11:08 kernel: [ 2300.084651] try_to_free_mem_cgroup_pages+0xf4/0x210
Aug 5 17:11:08 kernel: [ 2300.084653] try_charge+0x2eb/0x810
Aug 5 17:11:08 kernel: [ 2300.084654] ? find_get_entry+0xaf/0x170
Aug 5 17:11:08 kernel: [ 2300.084655] mem_cgroup_try_charge+0x71/0x190
Aug 5 17:11:08 kernel: [ 2300.084656] ? pagecache_get_page+0x2d/0x300
Aug 5 17:11:08 kernel: [ 2300.084657] mem_cgroup_try_charge_delay+0x22/0x50
Aug 5 17:11:08 kernel: [ 2300.084658] do_swap_page+0x220/0x9f0
Aug 5 17:11:08 kernel: [ 2300.084659] __handle_mm_fault+0x73b/0x7a0
Aug 5 17:11:08 kernel: [ 2300.084659] handle_mm_fault+0xca/0x200
Aug 5 17:11:08 kernel: [ 2300.084661] do_user_addr_fault+0x1f9/0x450
Aug 5 17:11:08 kernel: [ 2300.084662] __do_page_fault+0x58/0x90
Aug 5 17:11:08 kernel: [ 2300.084663] do_page_fault+0x2c/0xe0
Aug 5 17:11:08 kernel: [ 2300.084664] page_fault+0x34/0x40
Aug 5 17:11:08 kernel: [ 2300.084665] RIP: 0033:0x7fa168646be3
Aug 5 17:11:08 kernel: [ 2300.084666] Code: 4c 89 6d b8 49 89 5d 00 49 c7 45 08 00 00 00 00 4c 3b 6d b0 0f 83 1d 01 00 00 4c 89 6d b0 49 89 dd 4d 39 fd 0f 83 bd 00 00 00 <49> 8b 45 00 4c 89 eb 83 e0 03 48 83 f8 03 0f 84 09 01 00 00 42 0f
Aug 5 17:11:08 kernel: [ 2300.084666] RSP: 002b:00007fa0e41501b0 EFLAGS: 00010283
Aug 5 17:11:08 kernel: [ 2300.084667] RAX: 00000005237c2908 RBX: 0000000000000004 RCX: 00007fa0e41504b0
Aug 5 17:11:08 kernel: [ 2300.084667] RDX: 0000000000000004 RSI: 0000000594831fe8 RDI: 00007fa160745850
Aug 5 17:11:08 kernel: [ 2300.084668] RBP: 00007fa0e4150230 R08: 00000005237c28e8 R09: 00007fa1607458f0
Aug 5 17:11:08 kernel: [ 2300.084668] R10: 00007fa168f52d99 R11: 000000014b7bf600 R12: 00007fa1609924d0
Aug 5 17:11:08 kernel: [ 2300.084668] R13: 0000000594832008 R14: 0000000000000240 R15: 0000000595000000
我更换了硬件,但是也更换了磁盘。
当我使用 Minecraft 启动 docker 容器(翼手龙)时,有时它会冻结并出现上述错误。我找不到一些相关日志...
uname -a:
Linux X-X-X 5.4.0-42-generic #46-Ubuntu SMP Fri Jul 10 00:24:02 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
免费-h:
total used free shared buff/cache available Mem: 31Gi 594Mi 29Gi 4.0Mi 1.1Gi 30Gi Swap: 1.0Gi 0B 1.0Gi
sysctl vm.swappiness:
vm.swappiness = 60
sudo lshw-C 内存:
*-firmware
description: BIOS
vendor: Intel Corporation
physical id: 6
version: S1200SP.86B.03.01.0042.013020190050
date: 01/30/2019
size: 64KiB
capacity: 16MiB
capabilities: pci pnp upgrade shadowing cdboot bootselect edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int9keyboard int14serial int17printer int10video acpi usb ls120boot zipboot biosbootspecification netboot uefi
*-cache:0
description: L1 cache
physical id: 1a
slot: L1 Cache
size: 128KiB
capacity: 128KiB
capabilities: synchronous internal write-through instruction
configuration: level=1
*-cache:1
description: L2 cache
physical id: 1b
slot: L2 Cache
size: 1MiB
capacity: 1MiB
capabilities: synchronous internal write-through unified
configuration: level=2
*-cache:2
description: L3 cache
physical id: 1c
slot: L3 Cache
size: 8MiB
capacity: 8MiB
capabilities: synchronous internal write-back unified
configuration: level=3
*-cache
description: L1 cache
physical id: 19
slot: L1 Cache
size: 128KiB
capacity: 128KiB
capabilities: synchronous internal write-through data
configuration: level=1
*-memory
description: System Memory
physical id: 1e
slot: System board or motherboard
size: 32GiB
*-bank:0
description: [empty]
vendor: Empty/NO DIMM
physical id: 0
slot: DIMM_A1
*-bank:1
description: DIMM DDR4 Synchronous 2400 MHz (0.4 ns)
product: KHX2400C15/16G
vendor: Kingston
physical id: 1
serial: A800F9241
slot: DIMM_A2
size: 16GiB
width: 64 bits
clock: 2400MHz (0.4ns)
*-bank:2
description: [empty]
vendor: Empty/NO DIMM
physical id: 2
slot: DIMM_B1
*-bank:3
description: DIMM DDR4 Synchronous 2400 MHz (0.4 ns)
product: KHX2400C15/16G
vendor: Kingston
physical id: 3
serial: BE305496
slot: DIMM_B2
size: 16GiB
width: 64 bits
clock: 2400MHz (0.4ns)
*-memory UNCLAIMED
description: Memory controller
product: 100 Series/C230 Series Chipset Family Power Management Controller
vendor: Intel Corporation
physical id: 1f.2
bus info: pci@0000:00:1f.2
version: 31
width: 32 bits
clock: 33MHz (30.3ns)
capabilities: bus_master
configuration: latency=0
resources: memory:a2f10000-a2f13fff
grep -i 交换 /etc/fstab:
UUID="X-X-X-X-X" swap swap defaults 0 0
UUID="X-X-X-X-X" swap swap defaults 0 0
/swapfile swap swap defaults 0 0
有任何想法吗 ?
答案1
可能存在交换/内存问题。
BIOS
您的 BIOS 版本为 S1200SP.86B.03.01.0042.013020190050,发布日期为 2019 年 1 月 30 日。
有更新的 BIOS 可用,日期为 2020 年 6 月,可以下载这里。
注意:更新 BIOS 之前请做好备份。
记忆测试
去https://www.memtest86.com/并免费下载/运行它们memtest
来测试你的记忆力。至少完成一次所有 4/4 测试以确认记忆力良好。这可能需要几个小时才能完成。
更新#1:
正如我之前所想的......您有交换问题。
您有三个交换位置,如 /etc/fstab 所示!
UUID="X-X-X-X-X" swap swap defaults 0 0
UUID="X-X-X-X-X" swap swap defaults 0 0
/swapfile swap swap defaults 0 0
执行sudo swapoff -a
#关闭交换
然后在 /etc/fstab 中注释掉以上所有三行。
完全禁用交换永远是不行的。交换太小也不合适。你遇到了两个问题。
让我们为您的系统创建一个合适的 /swapfile。
笔记:命令使用不当dd
可能导致数据丢失。建议复制/粘贴。
sudo swapoff -a # turn off swap
sudo rm -i /swapfile # remove old /swapfile
sudo dd if=/dev/zero of=/swapfile bs=1M count=4096
sudo chmod 600 /swapfile # set proper file protections
sudo mkswap /swapfile # init /swapfile
sudo swapon /swapfile # turn on swap
free -h # confirm 32G RAM and 4G swap
将此行添加到 /etc/fstab...
/swapfile none swap sw 0 0
然后reboot
系统并验证运行。
如果一切正常,您可以使用gparted
删除 /etc/fstab 中注释掉的行中显示的 UUID 的两个磁盘分区。这里要小心,并确保要删除的分区正确。然后删除 /etc/fstab 中注释掉的三行。
答案2
尽管问题似乎已经得到回答,但是对于遇到相同 CPU 错误(除了 heynnemas 的答案)的任何人,请检查您的 PCI 电缆与所连接的任何显卡的连接。
我遇到了同样的错误,断开显卡后问题就解决了,后来我才发现显卡的 6 针连接有问题(而且烧焦了)。更换电缆后系统功能恢复正常。
我还建议检查 CPU/内存时序是否太疯狂,以及 CPU 冷却器是否正确(紧密)连接。
答案3
我在本地运行的 VM 场中的 VM 上遇到了此错误,该场的磁盘已满。虚拟机管理程序无法为“精简”磁盘分区分配更多空间(这些分区的物理空间是按需分配的,而场已超额认购)。请注意,虚拟机管理程序需要一定的开销才能运行(可能是 10%),并将保留该空间。
原来是其中一台物理机器出现了问题,没有报告释放的磁盘空间,导致虚拟机场产生磁盘已满的幻觉。重新启动该机器后,问题就消失了。我们正在进行操作系统和虚拟机管理程序更新 --- 希望这能防止将来出现该问题。
答案4
添加/etc/sysctl.conf
以下行并重新启动系统。
# Controls interval between generating an NMI perf monitoring interrupt that kernel uses to check for soft-lockup errors.
watchdog_thresh=120