症状:GPU 升级后,部分(约 25%)启动时运行顺畅,图形性能良好。其他时候 - 没有明显的模式,例如温度 - 问题始于启动时,启动时间较长,卡顿明显,总体不稳定。
硬件:台式机版本,技嘉 B450M(PCIe-3 主板)上的 Ryzen 5 3600X。使用 RTX 1660 Super 多年。最近将 GPU 升级到 RTX 4060。
软件/固件:所有最近升级:Kubuntu 23.10,内核 6.5.0-25-generic。Nvidia 驱动程序 545.29.06。UEFI BIOS 刷新至最新版本 (f65)。
dmesg 尾部:
[ 15.090954] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.1/0000:07:00.1/sound/card0/input34
[ 15.091025] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.1/0000:07:00.1/sound/card0/input35
[ 15.091093] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.1/0000:07:00.1/sound/card0/input36
[ 15.091149] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.1/0000:07:00.1/sound/card0/input37
[ 15.102874] input: HD-Audio Generic Front Mic as /devices/pci0000:00/0000:00:08.1/0000:09:00.4/sound/card2/input38
[ 15.102947] input: HD-Audio Generic Rear Mic as /devices/pci0000:00/0000:00:08.1/0000:09:00.4/sound/card2/input39
[ 15.103030] input: HD-Audio Generic Line as /devices/pci0000:00/0000:00:08.1/0000:09:00.4/sound/card2/input40
[ 15.103112] input: HD-Audio Generic Line Out as /devices/pci0000:00/0000:00:08.1/0000:09:00.4/sound/card2/input41
[ 15.103172] input: HD-Audio Generic Front Headphone as /devices/pci0000:00/0000:00:08.1/0000:09:00.4/sound/card2/input42
[ 15.153468] nvidia: loading out-of-tree module taints kernel.
[ 15.153479] nvidia: module license 'NVIDIA' taints kernel.
[ 15.153481] Disabling lock debugging due to kernel taint
[ 15.153484] nvidia: module license taints kernel.
[ 15.157925] usbcore: registered new interface driver snd-usb-audio
[ 15.304528] MCE: In-kernel MCE decoding enabled.
[ 15.307207] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[ 15.308703] nvidia 0000:07:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 15.353133] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 545.29.06 Thu Nov 16 01:59:08 UTC 2023
[ 15.386458] intel_rapl_common: Found RAPL domain package
[ 15.386464] intel_rapl_common: Found RAPL domain core
[ 15.386498] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 545.29.06 Thu Nov 16 01:47:29 UTC 2023
[ 15.400962] [drm] [nvidia-drm] [GPU ID 0x00000700] Loading driver
[ 15.644633] loop0: detected capacity change from 0 to 388320
[ 15.645753] loop1: detected capacity change from 0 to 510112
[ 15.648409] loop2: detected capacity change from 0 to 8
[ 15.650742] loop3: detected capacity change from 0 to 92120
[ 15.652826] loop4: detected capacity change from 0 to 92576
[ 15.655175] loop5: detected capacity change from 0 to 631904
[ 15.657612] loop6: detected capacity change from 0 to 631888
[ 15.659809] loop7: detected capacity change from 0 to 216720
[ 15.662515] loop8: detected capacity change from 0 to 215872
[ 15.665097] loop9: detected capacity change from 0 to 113992
[ 15.666924] loop10: detected capacity change from 0 to 113992
[ 15.669113] loop11: detected capacity change from 0 to 130888
[ 15.670029] loop12: detected capacity change from 0 to 130880
[ 15.671401] loop13: detected capacity change from 0 to 151784
[ 15.673122] loop14: detected capacity change from 0 to 151352
[ 15.674632] loop15: detected capacity change from 0 to 200104
[ 15.676321] loop16: detected capacity change from 0 to 200104
[ 15.677620] loop17: detected capacity change from 0 to 537600
[ 15.678938] loop18: detected capacity change from 0 to 546064
[ 15.680581] loop19: detected capacity change from 0 to 337560
[ 15.681953] loop20: detected capacity change from 0 to 337560
[ 15.684582] loop21: detected capacity change from 0 to 716168
[ 15.686198] loop22: detected capacity change from 0 to 716176
[ 15.687344] loop23: detected capacity change from 0 to 1017608
[ 15.689157] loop24: detected capacity change from 0 to 1017816
[ 15.690392] loop25: detected capacity change from 0 to 280
[ 15.692264] loop26: detected capacity change from 0 to 166424
[ 15.693875] loop27: detected capacity change from 0 to 187776
[ 15.695362] loop28: detected capacity change from 0 to 224144
[ 15.697289] loop29: detected capacity change from 0 to 299592
[ 15.698878] loop30: detected capacity change from 0 to 300792
[ 15.754976] audit: type=1400 audit(1710347660.404:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/busybox" pid=744 comm="apparmor_parser"
[ 15.755029] audit: type=1400 audit(1710347660.404:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/cam" pid=745 comm="apparmor_parser"
[ 15.755101] audit: type=1400 audit(1710347660.404:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/opt/brave.com/brave/brave" pid=738 comm="apparmor_parser"
[ 15.755153] audit: type=1400 audit(1710347660.404:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/ch-checkns" pid=746 comm="apparmor_parser"
[ 15.755204] audit: type=1400 audit(1710347660.404:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/opt/google/chrome/chrome" pid=739 comm="apparmor_parser"
[ 15.755253] audit: type=1400 audit(1710347660.404:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/buildah" pid=743 comm="apparmor_parser"
[ 15.755313] audit: type=1400 audit(1710347660.404:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/bin/toybox" pid=735 comm="apparmor_parser"
[ 15.755363] audit: type=1400 audit(1710347660.404:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/opt/microsoft/msedge/msedge" pid=740 comm="apparmor_parser"
[ 15.755426] audit: type=1400 audit(1710347660.404:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/opt/vivaldi/vivaldi-bin" pid=741 comm="apparmor_parser"
[ 15.755743] kvm_amd: SVM disabled (by BIOS) in MSR_VM_CR on CPU 2
[ 15.756824] audit: type=1400 audit(1710347660.404:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/ch-run" pid=749 comm="apparmor_parser"
[ 15.879129] RPC: Registered named UNIX socket transport module.
[ 15.879135] RPC: Registered udp transport module.
[ 15.879137] RPC: Registered tcp transport module.
[ 15.879138] RPC: Registered tcp-with-tls transport module.
[ 15.879140] RPC: Registered tcp NFSv4.1 backchannel transport module.
[ 16.712419] loop31: detected capacity change from 0 to 8
[ 16.792625] workqueue: sync_rcu_exp_select_node_cpus hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
[ 17.988655] Generic FE-GE Realtek PHY r8169-0-500:00: attached PHY driver (mii_bus:phy_addr=r8169-0-500:00, irq=MAC)
[ 18.200752] r8169 0000:05:00.0 enp5s0: Link is Down
[ 18.352625] workqueue: sync_rcu_exp_select_node_cpus hogged CPU for >10000us 8 times, consider switching to WQ_UNBOUND
[ 18.636312] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:07:00.0 on minor 0
[ 18.773717] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[ 18.986009] nvidia-uvm: Loaded the UVM driver, major device number 511.
[ 21.089851] kauditd_printk_skb: 117 callbacks suppressed
[ 21.089855] audit: type=1326 audit(1710347665.740:129): auid=4294967295 uid=0 gid=0 ses=4294967295 subj=snap.tvheadend.tvheadend pid=1127 comm="tvheadend" exe="/snap/tvheadend/216/usr/bin/tvheadend" sig=0 arch=c000003e syscall=92 compat=0 ip=0x70e2f8bc758b code=0x50000
[ 21.095301] r8169 0000:05:00.0 enp5s0: Link is Up - 1Gbps/Full - flow control rx/tx
[ 21.308689] audit: type=1400 audit(1710347665.960:130): apparmor="STATUS" operation="profile_load" profile="unconfined" name="docker-default" pid=1705 comm="apparmor_parser"
[ 21.314632] FS-Cache: Loaded
[ 21.539227] audit: type=1326 audit(1710347666.180:131): auid=4294967295 uid=0 gid=0 ses=4294967295 subj=snap.tvheadend.tvheadend pid=1127 comm="tvh:save" exe="/snap/tvheadend/216/usr/bin/tvheadend" sig=0 arch=c000003e syscall=141 compat=0 ip=0x70e2f8bcc40b code=0x50000
[ 21.539235] audit: type=1326 audit(1710347666.180:132): auid=4294967295 uid=0 gid=0 ses=4294967295 subj=snap.tvheadend.tvheadend pid=1127 comm="tvh:tasklet" exe="/snap/tvheadend/216/usr/bin/tvheadend" sig=0 arch=c000003e syscall=141 compat=0 ip=0x70e2f8bcc40b code=0x50000
[ 21.546952] audit: type=1400 audit(1710347666.196:133): apparmor="DENIED" operation="open" class="file" profile="snap.tvheadend.tvheadend" name="/usr/sbin/" pid=1127 comm="tvheadend" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[ 21.547793] audit: type=1400 audit(1710347666.196:134): apparmor="DENIED" operation="open" class="file" profile="snap.tvheadend.tvheadend" name="/usr/sbin/" pid=1127 comm="tvheadend" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[ 21.548439] audit: type=1400 audit(1710347666.196:135): apparmor="DENIED" operation="open" class="file" profile="snap.tvheadend.tvheadend" name="/usr/games/" pid=1127 comm="tvheadend" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[ 21.666692] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[ 21.668382] Bridge firewalling registered
[ 21.722441] Initializing XFRM netlink socket
[ 21.779321] NFS: Registering the id_resolver key type
[ 21.779335] Key type id_resolver registered
[ 21.779336] Key type id_legacy registered
[ 22.924595] audit: type=1400 audit(1710347667.576:136): apparmor="DENIED" operation="open" class="file" profile="snap.tvheadend.tvheadend" name="/usr/sbin/" pid=1752 comm="tv_find_grabber" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[ 22.925897] audit: type=1400 audit(1710347667.576:137): apparmor="DENIED" operation="open" class="file" profile="snap.tvheadend.tvheadend" name="/usr/sbin/" pid=1752 comm="tv_find_grabber" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[ 22.926517] audit: type=1400 audit(1710347667.576:138): apparmor="DENIED" operation="open" class="file" profile="snap.tvheadend.tvheadend" name="/usr/games/" pid=1752 comm="tv_find_grabber" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[ 27.432932] snd_hda_intel 0000:07:00.1: azx_get_response timeout, switching to polling mode: last cmd=0x001f0500
[ 27.967587] systemd-journald[390]: /var/log/journal/94423ebba5d94f25946a92f16626d35f/user-1000.journal: Monotonic clock jumped backwards relative to last journal entry, rotating.
[ 28.444430] snd_hda_intel 0000:07:00.1: azx_get_response timeout, switching to single_cmd mode: last cmd=0x001f0500
[ 38.320209] workqueue: sync_rcu_exp_select_node_cpus hogged CPU for >10000us 16 times, consider switching to WQ_UNBOUND
已经尝试过的事情
- 确认 nouveau 列入黑名单
- 确认所有旧版 nvidia 驱动程序 (525) 的痕迹均已删除
- 禁用 GPU 管理器
- 较旧的内核 (5.x) 和 nvidia 驱动程序(使用 ubuntu-drivers 的 525 似乎稳定但 fps 较慢,535 不稳定)
- 各种 nvidia 设置调整
**尚未尝试**
- 新
- nvidia-550 通过手动安装,领先于 kubuntu 集成
** 评论/猜测 **
我相信(在检查 PCpartpicker 并基于成功启动后)硬件兼容、PSU 足够等,尽管我承认这是边缘问题,而且我确实看到了 CPU 瓶颈(GPU 在运行良好时饱和使用率约为 95%)。不过,我现在很乐意接受这一点。
dmesg 清楚地显示问题出现在 nvidia 驱动程序加载后(从大约 16 秒开始其他驱动程序占用 CPU)
sudo systemd-analyze blame 还显示网络管理器延迟