Linux内核BUG:原子调度:swapper/0/0/0x7fff0001

Linux内核BUG:原子调度:swapper/0/0/0x7fff0001

我正在运行 Arch Linux,内核 5.17.3(尽管这个问题在许多版本中都发生过)。每隔几天,我就会遇到随机的整个系统冻结。内核日志各不相同,但最常见的是这样的:

...
Apr 02 05:04:20 starship kernel: BUG: scheduling while atomic: swapper/0/0/0x7fff0001
Apr 02 05:04:20 starship kernel: Modules linked in: tun uinput btrfs blake2b_generic xor raid6_pq dm_crypt cbc encrypted_keys trusted asn1_encoder tee dm_mod rfcomm snd_seq_dummy snd_hrtimer snd_seq hid_logitech_hidpp xt_CHECKSUM xt_MASQUERADE nft_chain_nat nf_nat bridge stp llc cmac algif_hash algif_skcipher af_alg bnep ip6t_REJECT nf_reject_ipv6 xt_hl mousedev hid_logitech_dj ip6_tables joydev ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog xt_comment xt_multiport nft_limit btusb btrtl btbcm xt_limit btintel xt_addrtype btmtk xt_tcpudp snd_usb_audio bluetooth xt_conntrack nf_conntrack snd_usbmidi_lib nf_defrag_ipv6 snd_rawmidi nf_defrag_ipv4 snd_seq_device usbhid ecdh_generic nft_compat nf_tables libcrc32c nfnetlink i2c_dev i2c_smbus nvidia_uvm(POE) nvidia_drm(POE) nvidia_modeset(POE) iwlmvm nvidia(POE) mac80211 intel_rapl_msr intel_rapl_common libarc4 edac_mce_amd eeepc_wmi kvm_amd iwlwifi asus_wmi sparse_keymap kvm iwlmei platform_profile irqbypass crct10dif_pclmul crc32_pclmul video wmi_bmof
Apr 02 05:04:20 starship kernel:  mxm_wmi asus_wmi_sensors ghash_clmulni_intel cfg80211 aesni_intel crypto_simd snd_hda_codec_realtek cryptd rfkill snd_hda_codec_generic vfat sp5100_tco fat rapl ledtrig_audio pcspkr snd_hda_codec_hdmi ccp i2c_piix4 k10temp igb mei e1000e tpm_crb dca tpm_tis tpm_tis_core snd_hda_intel tpm snd_intel_dspcfg gpio_amdpt rng_core snd_intel_sdw_acpi gpio_generic pinctrl_amd snd_hda_codec snd_hda_core snd_hwdep wmi mac_hid acpi_cpufreq snd_aloop snd_pcm snd_timer snd soundcore v4l2loopback_dc(OE) videodev mc crypto_user fuse bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 xhci_pci crc32c_intel xhci_pci_renesas
Apr 02 05:04:20 starship kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: P           OE     5.17.1-arch1-1 #1 0ea933cb6bfe82a8dc16ab834a4bccdd297f98b7
Apr 02 05:04:20 starship kernel: Hardware name: System manufacturer System Product Name/ROG STRIX B450-F GAMING, BIOS 4801 03/02/2022
Apr 02 05:04:20 starship kernel: Call Trace:
Apr 02 05:04:20 starship kernel:  <TASK>
Apr 02 05:04:20 starship kernel:  dump_stack_lvl+0x48/0x5e
Apr 02 05:04:20 starship kernel:  __schedule_bug.cold+0x4c/0x58
Apr 02 05:04:20 starship kernel:  __schedule+0xd55/0x10a0
Apr 02 05:04:20 starship kernel:  ? hrtimer_start_range_ns+0x272/0x350
Apr 02 05:04:20 starship kernel:  schedule_idle+0x26/0x40
Apr 02 05:04:20 starship kernel:  do_idle+0x16d/0x260
Apr 02 05:04:20 starship kernel:  cpu_startup_entry+0x19/0x20
Apr 02 05:04:20 starship kernel:  start_kernel+0x9a2/0x9c9
Apr 02 05:04:20 starship kernel:  secondary_startup_64_no_verify+0xd5/0xdb
Apr 02 05:04:20 starship kernel:  </TASK>
Apr 02 05:04:20 starship kernel: [UFW BLOCK] IN=enp10s0 OUT= MAC=04:d4:c4:55:3e:fc:98:09:cf:93:64:22:08:00 SRC=192.168.4.7 DST=192.168.4.2 LEN=1909 TOS=0x00 PREC=0x00 TTL=64 ID=44904 PROTO=UDP SPT=40665 DPT=1716 LEN=1889
...

这有时接近日志的末尾,但有时也有几条(数千条)线,然后才收到大量投诉systemd。这可能是我崩溃的问题吗?我应该寻找其他东西吗?如果这可能是问题所在,我应该如何调试它?我认为这可能是我的机器上写得不好的程序/驱动程序/内核模块,但我不知道从哪里开始找出是哪一个。

如果我在发生这种情况时使用计算机,应用程序通常会首先冻结,然后几乎立即冻结桌面环境(Cinnamon),但通常我仍然可以移动鼠标约 30 秒,然后它完全挂起,我必须硬重置。如果我不在计算机旁,它就不会响应 ping,或者我回来时它会“运行”,但不会从睡眠/屏幕保护程序/DE 空闲时执行的任何操作中唤醒,我必须对其进行硬重置。

我尝试过的事情(其中许多来自预感/建议,这可能是硬件问题):

  • 更新BIOS
  • 禁用 CPU 空闲状态(发现这可能是 Ryzen CPU/芯片组的常见问题后)
  • 降频 RAM(从 3600MHz 广告速度降到 3200MHz,这应该是主板支持的速度)
  • 对CPU(使用mprime)和RAM(使用Memtest86,因为Memtest86+无法启动)进行压力测试,没有发现错误

这仍然是硬件问题吗?或者,我应该从哪里开始调试/寻找软件问题?

如果有帮助的话我可以提供更多信息。另外,如果有更好的地方可以问这个问题,也请告诉我。谢谢!

答案1

这有运气吗?

我想我看到了类似的东西:

Jun 02 11:50:34 three kernel: BUG: scheduling while atomic: swapper/0/0/0x00000002
Jun 02 11:50:34 three kernel: Modules linked in: rfcomm xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c br_netfilter bridge stp llc overlay uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 snd_usb_audio videobuf2_common snd_usbmidi_lib videodev snd_rawmidi snd_seq_device hid_jabra mc gs_usb can_dev cmac algif_hash algif_skcipher af_alg bnep nct6775 hwmon_vid btusb btrtl btbcm btintel btmtk bluetooth nls_iso8859_1 mousedev hid_logitech_hidpp vfat nzxt_kraken2 joydev ecdh_generic fat usbhid intel_rapl_msr intel_rapl_common iwlmvm snd_hda_codec_realtek edac_mce_amd snd_hda_codec_generic wmi_bmof wl(POE) kvm_amd ledtrig_audio mac80211 snd_hda_codec_hdmi snd_hda_intel amdgpu snd_intel_dspcfg libarc4 kvm snd_intel_sdw_acpi snd_hda_codec irqbypass iwlwifi snd_hda_core crct10dif_pclmul crc32_pclmul snd_hwdep ghash_clmulni_intel iwlmei snd_pcm gpu_sched aesni_intel snd_timer drm_ttm_helper
Jun 02 11:50:34 three kernel:  crypto_simd ttm cryptd snd rapl cfg80211 mei ccp drm_dp_helper soundcore pcspkr igb sp5100_tco k10temp rng_core i2c_piix4 dca gpio_amdpt mac_hid rfkill wmi gpio_generic pinctrl_amd acpi_cpufreq vboxnetflt(OE) vboxnetadp(OE) vboxdrv(OE) i2c_dev crypto_user fuse bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 nvme crc32c_intel xhci_pci nvme_core xhci_pci_renesas
Jun 02 11:50:34 three kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: P           OE     5.18.1-arch1-1 #1 aeb6a372044721fe869dfc17901d8ed9fc452f1a
Jun 02 11:50:34 three kernel: Hardware name: Micro-Star International Co., Ltd. MS-7B85/B450 GAMING PRO CARBON AC (MS-7B85), BIOS 1.B0 11/08/2019
Jun 02 11:50:34 three kernel: Call Trace:
Jun 02 11:50:34 three kernel:  <TASK>
Jun 02 11:50:34 three kernel:  dump_stack_lvl+0x48/0x5d
Jun 02 11:50:34 three kernel:  __schedule_bug.cold+0x4b/0x57
Jun 02 11:50:34 three kernel:  __schedule+0xdee/0x11f0
Jun 02 11:50:34 three kernel:  schedule_idle+0x2a/0x40
Jun 02 11:50:34 three kernel:  cpu_startup_entry+0x1d/0x20
Jun 02 11:50:34 three kernel:  rest_init+0xc8/0xd0
Jun 02 11:50:34 three kernel:  arch_call_rest_init+0xe/0x19
Jun 02 11:50:34 three kernel:  start_kernel+0x971/0x997
Jun 02 11:50:34 three kernel:  secondary_startup_64_no_verify+0xd5/0xdb
Jun 02 11:50:34 three kernel:  </TASK>

以下是出现在两个列表中的模块的列表:

bluetooth
bnep
bridge
btbcm
btintel
btmtk
btrtl
btusb
ccp
cfg80211
cmac
crc16
cryptd
dca
ext4
fat
fuse
igb
irqbypass
iwlmei
iwlmvm
iwlwifi
jbd2
joydev
k10temp
kvm
libarc4
libcrc32c
llc
mac80211
mbcache
mc
mei
mousedev
nfnetlink
OE
pcspkr
POE
rapl
rfcomm
rfkill
snd
soundcore
stp
usbhid
vfat
videodev
wmi

对我来说,这通常发生在硬锁前几分钟的内核日志中,日志中没有其他内容。我也在arch,这已经持续了几个月了。平均1-3天发生一次。

我也尝试过运行memtest来排除坏的RAM棒或其他东西,但我很确定此时它是linux。

相关内容