升级 Nvidia 卡 1660 - 4060 后出现间歇性启动问题

升级 Nvidia 卡 1660 - 4060 后出现间歇性启动问题

症状:GPU 升级后,部分(约 25%)启动时运行顺畅,图形性能良好。其他时候 - 没有明显的模式,例如温度 - 问题始于启动时,启动时间较长,卡顿明显,总体不稳定。

硬件:台式机版本,技嘉 B450M(PCIe-3 主板)上的 Ryzen 5 3600X。使用 RTX 1660 Super 多年。最近将 GPU 升级到 RTX 4060。

软件/固件:所有最近升级:Kubuntu 23.10,内核 6.5.0-25-generic。Nvidia 驱动程序 545.29.06。UEFI BIOS 刷新至最新版本 (f65)。

dmesg 尾部:

[   15.090954] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.1/0000:07:00.1/sound/card0/input34
[   15.091025] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.1/0000:07:00.1/sound/card0/input35
[   15.091093] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.1/0000:07:00.1/sound/card0/input36
[   15.091149] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.1/0000:07:00.1/sound/card0/input37
[   15.102874] input: HD-Audio Generic Front Mic as /devices/pci0000:00/0000:00:08.1/0000:09:00.4/sound/card2/input38
[   15.102947] input: HD-Audio Generic Rear Mic as /devices/pci0000:00/0000:00:08.1/0000:09:00.4/sound/card2/input39
[   15.103030] input: HD-Audio Generic Line as /devices/pci0000:00/0000:00:08.1/0000:09:00.4/sound/card2/input40
[   15.103112] input: HD-Audio Generic Line Out as /devices/pci0000:00/0000:00:08.1/0000:09:00.4/sound/card2/input41
[   15.103172] input: HD-Audio Generic Front Headphone as /devices/pci0000:00/0000:00:08.1/0000:09:00.4/sound/card2/input42
[   15.153468] nvidia: loading out-of-tree module taints kernel.
[   15.153479] nvidia: module license 'NVIDIA' taints kernel.
[   15.153481] Disabling lock debugging due to kernel taint
[   15.153484] nvidia: module license taints kernel.
[   15.157925] usbcore: registered new interface driver snd-usb-audio
[   15.304528] MCE: In-kernel MCE decoding enabled.
[   15.307207] nvidia-nvlink: Nvlink Core is being initialized, major device number 235

[   15.308703] nvidia 0000:07:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[   15.353133] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  545.29.06  Thu Nov 16 01:59:08 UTC 2023
[   15.386458] intel_rapl_common: Found RAPL domain package
[   15.386464] intel_rapl_common: Found RAPL domain core
[   15.386498] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  545.29.06  Thu Nov 16 01:47:29 UTC 2023
[   15.400962] [drm] [nvidia-drm] [GPU ID 0x00000700] Loading driver
[   15.644633] loop0: detected capacity change from 0 to 388320
[   15.645753] loop1: detected capacity change from 0 to 510112
[   15.648409] loop2: detected capacity change from 0 to 8
[   15.650742] loop3: detected capacity change from 0 to 92120
[   15.652826] loop4: detected capacity change from 0 to 92576
[   15.655175] loop5: detected capacity change from 0 to 631904
[   15.657612] loop6: detected capacity change from 0 to 631888
[   15.659809] loop7: detected capacity change from 0 to 216720
[   15.662515] loop8: detected capacity change from 0 to 215872
[   15.665097] loop9: detected capacity change from 0 to 113992
[   15.666924] loop10: detected capacity change from 0 to 113992
[   15.669113] loop11: detected capacity change from 0 to 130888
[   15.670029] loop12: detected capacity change from 0 to 130880
[   15.671401] loop13: detected capacity change from 0 to 151784
[   15.673122] loop14: detected capacity change from 0 to 151352
[   15.674632] loop15: detected capacity change from 0 to 200104
[   15.676321] loop16: detected capacity change from 0 to 200104
[   15.677620] loop17: detected capacity change from 0 to 537600
[   15.678938] loop18: detected capacity change from 0 to 546064
[   15.680581] loop19: detected capacity change from 0 to 337560
[   15.681953] loop20: detected capacity change from 0 to 337560
[   15.684582] loop21: detected capacity change from 0 to 716168
[   15.686198] loop22: detected capacity change from 0 to 716176
[   15.687344] loop23: detected capacity change from 0 to 1017608
[   15.689157] loop24: detected capacity change from 0 to 1017816
[   15.690392] loop25: detected capacity change from 0 to 280
[   15.692264] loop26: detected capacity change from 0 to 166424
[   15.693875] loop27: detected capacity change from 0 to 187776
[   15.695362] loop28: detected capacity change from 0 to 224144
[   15.697289] loop29: detected capacity change from 0 to 299592
[   15.698878] loop30: detected capacity change from 0 to 300792
[   15.754976] audit: type=1400 audit(1710347660.404:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/busybox" pid=744 comm="apparmor_parser"
[   15.755029] audit: type=1400 audit(1710347660.404:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/cam" pid=745 comm="apparmor_parser"
[   15.755101] audit: type=1400 audit(1710347660.404:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/opt/brave.com/brave/brave" pid=738 comm="apparmor_parser"
[   15.755153] audit: type=1400 audit(1710347660.404:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/ch-checkns" pid=746 comm="apparmor_parser"
[   15.755204] audit: type=1400 audit(1710347660.404:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/opt/google/chrome/chrome" pid=739 comm="apparmor_parser"
[   15.755253] audit: type=1400 audit(1710347660.404:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/buildah" pid=743 comm="apparmor_parser"
[   15.755313] audit: type=1400 audit(1710347660.404:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/bin/toybox" pid=735 comm="apparmor_parser"
[   15.755363] audit: type=1400 audit(1710347660.404:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/opt/microsoft/msedge/msedge" pid=740 comm="apparmor_parser"
[   15.755426] audit: type=1400 audit(1710347660.404:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/opt/vivaldi/vivaldi-bin" pid=741 comm="apparmor_parser"
[   15.755743] kvm_amd: SVM disabled (by BIOS) in MSR_VM_CR on CPU 2
[   15.756824] audit: type=1400 audit(1710347660.404:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/ch-run" pid=749 comm="apparmor_parser"
[   15.879129] RPC: Registered named UNIX socket transport module.
[   15.879135] RPC: Registered udp transport module.
[   15.879137] RPC: Registered tcp transport module.
[   15.879138] RPC: Registered tcp-with-tls transport module.
[   15.879140] RPC: Registered tcp NFSv4.1 backchannel transport module.
[   16.712419] loop31: detected capacity change from 0 to 8
[   16.792625] workqueue: sync_rcu_exp_select_node_cpus hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
[   17.988655] Generic FE-GE Realtek PHY r8169-0-500:00: attached PHY driver (mii_bus:phy_addr=r8169-0-500:00, irq=MAC)
[   18.200752] r8169 0000:05:00.0 enp5s0: Link is Down
[   18.352625] workqueue: sync_rcu_exp_select_node_cpus hogged CPU for >10000us 8 times, consider switching to WQ_UNBOUND
[   18.636312] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:07:00.0 on minor 0
[   18.773717] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[   18.986009] nvidia-uvm: Loaded the UVM driver, major device number 511.
[   21.089851] kauditd_printk_skb: 117 callbacks suppressed
[   21.089855] audit: type=1326 audit(1710347665.740:129): auid=4294967295 uid=0 gid=0 ses=4294967295 subj=snap.tvheadend.tvheadend pid=1127 comm="tvheadend" exe="/snap/tvheadend/216/usr/bin/tvheadend" sig=0 arch=c000003e syscall=92 compat=0 ip=0x70e2f8bc758b code=0x50000
[   21.095301] r8169 0000:05:00.0 enp5s0: Link is Up - 1Gbps/Full - flow control rx/tx
[   21.308689] audit: type=1400 audit(1710347665.960:130): apparmor="STATUS" operation="profile_load" profile="unconfined" name="docker-default" pid=1705 comm="apparmor_parser"
[   21.314632] FS-Cache: Loaded
[   21.539227] audit: type=1326 audit(1710347666.180:131): auid=4294967295 uid=0 gid=0 ses=4294967295 subj=snap.tvheadend.tvheadend pid=1127 comm="tvh:save" exe="/snap/tvheadend/216/usr/bin/tvheadend" sig=0 arch=c000003e syscall=141 compat=0 ip=0x70e2f8bcc40b code=0x50000
[   21.539235] audit: type=1326 audit(1710347666.180:132): auid=4294967295 uid=0 gid=0 ses=4294967295 subj=snap.tvheadend.tvheadend pid=1127 comm="tvh:tasklet" exe="/snap/tvheadend/216/usr/bin/tvheadend" sig=0 arch=c000003e syscall=141 compat=0 ip=0x70e2f8bcc40b code=0x50000
[   21.546952] audit: type=1400 audit(1710347666.196:133): apparmor="DENIED" operation="open" class="file" profile="snap.tvheadend.tvheadend" name="/usr/sbin/" pid=1127 comm="tvheadend" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[   21.547793] audit: type=1400 audit(1710347666.196:134): apparmor="DENIED" operation="open" class="file" profile="snap.tvheadend.tvheadend" name="/usr/sbin/" pid=1127 comm="tvheadend" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[   21.548439] audit: type=1400 audit(1710347666.196:135): apparmor="DENIED" operation="open" class="file" profile="snap.tvheadend.tvheadend" name="/usr/games/" pid=1127 comm="tvheadend" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[   21.666692] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[   21.668382] Bridge firewalling registered
[   21.722441] Initializing XFRM netlink socket
[   21.779321] NFS: Registering the id_resolver key type
[   21.779335] Key type id_resolver registered
[   21.779336] Key type id_legacy registered
[   22.924595] audit: type=1400 audit(1710347667.576:136): apparmor="DENIED" operation="open" class="file" profile="snap.tvheadend.tvheadend" name="/usr/sbin/" pid=1752 comm="tv_find_grabber" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[   22.925897] audit: type=1400 audit(1710347667.576:137): apparmor="DENIED" operation="open" class="file" profile="snap.tvheadend.tvheadend" name="/usr/sbin/" pid=1752 comm="tv_find_grabber" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[   22.926517] audit: type=1400 audit(1710347667.576:138): apparmor="DENIED" operation="open" class="file" profile="snap.tvheadend.tvheadend" name="/usr/games/" pid=1752 comm="tv_find_grabber" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[   27.432932] snd_hda_intel 0000:07:00.1: azx_get_response timeout, switching to polling mode: last cmd=0x001f0500
[   27.967587] systemd-journald[390]: /var/log/journal/94423ebba5d94f25946a92f16626d35f/user-1000.journal: Monotonic clock jumped backwards relative to last journal entry, rotating.
[   28.444430] snd_hda_intel 0000:07:00.1: azx_get_response timeout, switching to single_cmd mode: last cmd=0x001f0500
[   38.320209] workqueue: sync_rcu_exp_select_node_cpus hogged CPU for >10000us 16 times, consider switching to WQ_UNBOUND

已经尝试过的事情

  • 确认 nouveau 列入黑名单
  • 确认所有旧版 nvidia 驱动程序 (525) 的痕迹均已删除
  • 禁用 GPU 管理器
  • 较旧的内核 (5.x) 和 nvidia 驱动程序(使用 ubuntu-drivers 的 525 似乎稳定但 fps 较慢,535 不稳定)
  • 各种 nvidia 设置调整

**尚未尝试**

  • nvidia-550 通过手动安装,领先于 kubuntu 集成

** 评论/猜测 **

  • 我相信(在检查 PCpartpicker 并基于成功启动后)硬件兼容、PSU 足够等,尽管我承认这是边缘问题,而且我确实看到了 CPU 瓶颈(GPU 在运行良好时饱和使用率约为 95%)。不过,我现在很乐意接受这一点。

  • dmesg 清楚地显示问题出现在 nvidia 驱动程序加载后(从大约 16 秒开始其他驱动程序占用 CPU)

  • sudo systemd-analyze blame 还显示网络管理器延迟

相关内容