我的机器一直无法检测到我的第二块 GPU(两块都是 RTX 3090)。这不是一台新机器,这个问题是几周前出现的,我通过回滚到旧内核(未知版本)解决了这个问题。但在最近的一次更新后,我丢失了那个内核,并一直被这个问题困扰。
以下是我迄今为止尝试过的方法:
- 在 PCI 插槽中交换 GPU 以排除硬件问题
- 更新至最新主板 BIOS
- 为以下每个驱动程序安装全新 22.04
- 从 NVIDIA 下载页面 (deb 本地、deb 网络和运行文件) 安装每个 NVIDIA CUDA (>= 11.7)
- 我能使用的每一个 Ubuntu nvidia-driver* 都至少要保持 CUDA 版本 11.7
- 使用主线回滚到任意旧内核版本(5.15)
- 前滚至内核 6.4
- 使用连接到 GPU 2 的 HDMI 显示器进行启动
*请注意,所有较旧的 Ubuntu nvidia-drivers-5XX 都是过渡到 525 或 535 的软件包(apt search nvidia-driver
)。我使用的两个 GPU 都工作的最后一个驱动程序是 515。
列出的单个 GPU(也是我的显示 GPU)确实运行 CUDA 工作负载,但当作业(PyTorch)启动几分钟时似乎会使我的系统不稳定/滞后。
❯ uname -r
5.19.0-46-generic
❯ lspci | grep VGA
09:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
43:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
❯ nvidia-smi
Sat Jul 1 12:11:41 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:43:00.0 On | N/A |
| 0% 41C P8 24W / 350W | 562MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1879 G /usr/lib/xorg/Xorg 140MiB |
| 0 N/A N/A 2338 C+G ...ome-remote-desktop-daemon 258MiB |
| 0 N/A N/A 2375 G /usr/bin/gnome-shell 87MiB |
| 0 N/A N/A 3338 G ...566776601308618822,262144 73MiB |
+-----------------------------------------------------------------------------+
dmesg
链接至 GitHub Gist
奇怪的是,在全新安装 CUDA(不局限于单个驱动程序版本)并重启后,第二个 GPU 确实会出现nvidia-smi
。但重启后它又消失了。卸载并重新安装 CUDA 可以复制此情况,但它似乎是随机的(而且不是我每次重启时想要做的)
有什么办法可以让我的机器再次正常工作吗?
答案1
sudo dkms autoinstall
重建 Nvidia 内核模块可能会有所帮助。
答案2
截至今天,我发现的唯一解决方案是使用mainline
安装内核版本 5.15。这恢复了我的第二个 GPU nvidia-smi
。
我不知道为什么当前的 22.04.2 LTS 映像使用 5.19,因为它声明这里22.04 LTS 应该是 5.15。常规更新居然会造成这个问题,这很奇怪 - 我确信人们使用 LTS 版本的主要原因就是为了避免这种问题。
编辑:基于发行说明
Ubuntu Desktop 将在最新一代认证设备 (linux-oem-22.04) 上自动选择使用 v5.17 内核
Ubuntu Server 默认采用非滚动 LTS 内核 v5.15 (linux-generic)
因此看起来 5.15 可能仅适用于 Ubuntu Server,而 Ubuntu Desktop 使用滚动内核。可惜当前内核似乎损坏了某些东西……
答案3
@Anjum Sayed,您能详细说明一下您是如何恢复它的吗?我正在使用双启动的 Windows 10/Ubuntu 20.04 桌面,遇到了同样的问题,我再也看不到 GPU RTX 3090 核心了:
Loading new nvidia-465.19.01 DKMS files…
Building for 5.15.0-76-generic
Building for architecture x86_64
Building initial module for 5.15.0-76-generic
ERROR: Cannot create report: [Errno 17] File exists: ‘/var/crash/nvidia-dkms-465.0.crash’
Error! Bad return status for module build on kernel: 5.15.0-76-generic (x86_64)
Consult /var/lib/dkms/nvidia/465.19.01/build/make.log for more information.
dpkg: error processing package nvidia-dkms-465 (–configure):
installed nvidia-dkms-465 package post-installation script subprocess returned error exit status 10
dpkg: dependency problems prevent configuration of cuda-drivers-465:
cuda-drivers-465 depends on nvidia-dkms-465 (>= 465.19.01); however:
Package nvidia-dkms-465 is not configured yet.
dpkg: error processing package cuda-drivers-465 (–configure):
dependency problems - leaving unconfigured
No apport report written because the error message indicates its a followup error from a previous failure.
No apport report written because the error message indicates its a followup error from a previous failure.
dpkg: dependency problems prevent configuration of cuda-drivers:
cuda-drivers depends on cuda-drivers-465 (= 465.19.01-1); however:
Package cuda-drivers-465 is not configured yet.
dpkg: error processing package cuda-drivers (–configure):
dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of nvidia-driver-465:
nvidia-driver-465 depends on nvidia-dkms-465 (= 465.19.01-0ubuntu1); however:
Package nvidia-dkms-465 is not configured yet.
dpkg: error processing package nvidia-driver-465 (–configure):
dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of cuda-runtime-11-3:
cuda-runtime-11-3 depends on cuda-drivers (>= 465.19.01); however:
Package cuda-drivers is not configured yet.
dpkg: error processing package cuda-runtime-11-3 (–configure):
dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of cuda-demo-suite-11-3:
cuda-demo-suite-11-3 depends on cuda-runtimeNo apport report written because MaxReports is reached already
No apport report written because MaxReports is reached already
No apport report written because MaxReports is reached already
-11-3; however:
Package cuda-runtime-11-3 is not configured yet.
dpkg: error processing package cuda-demo-suite-11-3 (–configure):
dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of cuda-11-3:
cuda-11-3 depends on cuda-runtime-11-3 (>= 11.3.1); however:
Package cuda-runtime-11-3 is not configured yet.
cuda-11-3 depends on cuda-demo-suite-11-3 (>= 11.3.58); however:
Package cuda-demo-suite-11-3 is not configured yet.
dpkg: error processing package cuda-11-3 (–configure):
dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of cuda:
cuda depends on cuda-11-3 (>= 11.3.1); however:
Package cuda-11-3 is not configured yet.
No apport report written because MaxReports is reached already
No apport report written because MaxReports is reached already
dpkg: error processing package cuda (–configure):
dependency problems - leaving unconfigured
Processing triggers for initramfs-tools (0.136ubuntu6.7) …
update-initramfs: Generating /boot/initrd.img-5.15.0-76-generic
Errors were encountered while processing:
nvidia-dkms-465
cuda-drivers-465
cuda-drivers
nvidia-driver-465
cuda-runtime-11-3
cuda-demo-suite-11-3
cuda-11-3
cuda
E: Sub-process /usr/bin/dpkg returned an error code (1)