检测到 Ubuntu 服务器硬锁定

检测到 Ubuntu 服务器硬锁定

我在使用 Ubuntu Server 22.04 时遇到了一个问题,似乎在随机时间出现死机,但平均每 12 小时出现一次。

我正在运行 netconsole 来获取日志,当问题发生时,我可以 ping 服务器,但不能做其他事情。SSH 不起作用,netconsole 或其他任何东西也不起作用。

我的内核版本是5.15.0-53。我的硬件是:

  • 技嘉 A320M 主板
  • AMD 锐龙 5 1600
  • Geforce GT 710 用于访问 BIOS 等。不运行桌面环境。
  • 4TB 硬盘
  • 8GB 内存

我测试了 RAM 和 HDD,它们都恢复正常。我更换了 PSU,但也没起什么作用。

CPU 是我升级主 PC 后使用的旧 CPU。它完美地运行 Linux,从未给我带来麻烦,所以如果这是硬件问题,我想一定是主板的问题。我正在考虑移除 GPU 来测试是否能解决问题,但问题不是在我添加 GPU 时开始的,所以我怀疑这不是问题所在。

每当我遇到锁定(硬锁定或软锁定)时,RIP 总是:

RIP:0010:smp_call_function_many_cond+0x13a/0x360

来自 netconsole 的日志:

Nov 23 23:01:01 192.168.0.100 [26450.434430] NMI watchdog: Watchdog detected hard LOCKUP on cpu 5
Nov 23 23:01:01 192.168.0.100 [26450.434434] Modules linked in: iptable_filter bpfilter xt_nat xt_tcpudp veth xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay nls_iso8859_1 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi intel_rapl_msr intel_rapl_common snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec edac_mce_amd snd_hda_core snd_hwdep snd_pcm nvidiafb kvm vgastate fb_ddc cdc_acm snd_timer rapl snd i2c_algo_bit soundcore ccp wmi_bmof gigabyte_wmi k10temp mac_hid nvidia_uvm(POE) sch_fq_codel netconsole hwmon_vid msr parport_pc ppdev dm_multipath lp pstore_blk ramoops parport scsi_dh_rdac scsi_dh_emc scsi_dh_alua efi_pstore pstore_zone reed_solomon ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc
Nov 23 23:01:01 192.168.0.100 32c
Nov 23 23:01:01 192.168.0.100 [26450.434525]  raid1 raid0 multipath linear nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) drm_kms_helper crct10dif_pclmul syscopyarea crc32_pclmul sysfillrect sysimgblt ghash_clmulni_intel fb_sys_fops aesni_intel cec crypto_simd r8169 cryptd rc_core ahci gpio_amdpt xhci_pci drm i2c_piix4 realtek libahci xhci_pci_renesas wmi gpio_generic
Nov 23 23:01:01 192.168.0.100 [26450.434557] CPU: 5 PID: 164 Comm: kworker/5:1 Tainted: P           OE     5.15.0-53-generic #59-Ubuntu
Nov 23 23:01:01 192.168.0.100 [26450.434563] Hardware name: Gigabyte Technology Co., Ltd. A320M-S2H/A320M-S2H-CF, BIOS F5a 07/29/2022
Nov 23 23:01:01 192.168.0.100 [26450.434566] Workqueue: events free_work
Nov 23 23:01:01 192.168.0.100 [26450.434576] RIP: 0010:smp_call_function_many_cond+0x13a/0x360
Nov 23 23:01:01 192.168.0.100 [26450.434585] Code: b0 0a 02 41 89 c4 73 2e 4d 63 ec 48 8b 0b 49 81 fd ff 1f 00 00 0f 87 e4 01 00 00 4a 03 0c ed e0 ca ae ae 8b 41 08 a8 01 74 0a <f3> 90 8b 51 08 83 e2 01 75 f6 eb bb 48 83 c4 40 5b 41 5c 41 5d 41
Nov 23 23:01:01 192.168.0.100 [26450.434588] RSP: 0018:ffffa87f007d7cb0 EFLAGS: 00000202
Nov 23 23:01:01 192.168.0.100 [26450.434592] RAX: 0000000000000011 RBX: ffff91ef76971bc0 RCX: ffff91ef76837a40
Nov 23 23:01:01 192.168.0.100 [26450.434595] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff91ee4006c840
Nov 23 23:01:01 192.168.0.100 [26450.434598] RBP: ffffa87f007d7d18 R08: 0000000000000000 R09: 0000000000000000
Nov 23 23:01:01 192.168.0.100 [26450.434600] R10: 0000000000000000 R11: ffffffffffffffff R12: 0000000000000000
Nov 23 23:01:01 192.168.0.100 [26450.434602] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000020
Nov 23 23:01:01 192.168.0.100 [26450.434604] FS:  0000000000000000(0000) GS:ffff91ef76940000(0000) knlGS:0000000000000000
Nov 23 23:01:01 192.168.0.100 [26450.434607] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 23 23:01:01 192.168.0.100 [26450.434610] CR2: 00007f70eaca801c CR3: 0000000026210000 CR4: 00000000003506e0
Nov 23 23:01:01 192.168.0.100 [26450.434613] Call Trace:
Nov 23 23:01:01 192.168.0.100 [26450.434615]  <TASK>
Nov 23 23:01:01 192.168.0.100 [26450.434620]  ? invalidate_user_asid+0x30/0x30
Nov 23 23:01:01 192.168.0.100 [26450.434631]  on_each_cpu_cond_mask+0x1d/0x30
Nov 23 23:01:01 192.168.0.100 [26450.434635]  flush_tlb_kernel_range+0x41/0xa0
Nov 23 23:01:01 192.168.0.100 [26450.434641]  __purge_vmap_area_lazy+0xbd/0x6f0
Nov 23 23:01:01 192.168.0.100 [26450.434646]  ? __update_idle_core+0x93/0x120
Nov 23 23:01:01 192.168.0.100 [26450.434652]  ? __cond_resched+0x1a/0x50
Nov 23 23:01:01 192.168.0.100 [26450.434659]  free_vmap_area_noflush+0x2c7/0x310
Nov 23 23:01:01 192.168.0.100 [26450.434665]  remove_vm_area+0xa5/0xc0
Nov 23 23:01:01 192.168.0.100 [26450.434670]  __vunmap+0x93/0x260
Nov 23 23:01:01 192.168.0.100 [26450.434675]  free_work+0x25/0x40
Nov 23 23:01:01 192.168.0.100 [26450.434680]  process_one_work+0x22b/0x3d0
Nov 23 23:01:01 192.168.0.100 [26450.434685]  worker_thread+0x53/0x420
Nov 23 23:01:01 192.168.0.100 [26450.434688]  ? process_one_work+0x3d0/0x3d0
Nov 23 23:01:01 192.168.0.100 [26450.434692]  kthread+0x12a/0x150
Nov 23 23:01:01 192.168.0.100 [26450.434696]  ? set_kthread_struct+0x50/0x50
Nov 23 23:01:01 192.168.0.100 [26450.434701]  ret_from_fork+0x22/0x30
Nov 23 23:01:01 192.168.0.100 [26450.434710]  </TASK>
Nov 23 23:01:01 192.168.0.100 [26450.434715] perf: interrupt took too long (2634 > 2500), lowering kernel.perf_event_max_sample_rate to 75750

如果您需要更多日志或信息,请询问。我已经处理这个问题很长时间了。

答案1

我终于解决了我的问题!似乎发生了这样的事情,是因为我的 BIOS 中的虚拟化功能在某个时候被关闭了,这导致了这个问题。我在 BIOS 设置中重新启用了虚拟化功能,并且已经顺利运行了一个星期。

相关内容