Ubuntu 22.04 中的 nvidia-smi 中未显示第二个 GPU

Ubuntu 22.04 中的 nvidia-smi 中未显示第二个 GPU

我的机器一直无法检测到我的第二块 GPU(两块都是 RTX 3090)。这不是一台新机器,这个问题是几周前出现的,我通过回滚到旧内核(未知版本)解决了这个问题。但在最近的一次更新后,我丢失了那个内核,并一直被这个问题困扰。

以下是我迄今为止尝试过的方法:

  • 在 PCI 插槽中交换 GPU 以排除硬件问题
  • 更新至最新主板 BIOS
  • 为以下每个驱动程序安装全新 22.04
  • 从 NVIDIA 下载页面 (deb 本地、deb 网络和运行文件) 安装每个 NVIDIA CUDA (>= 11.7)
  • 我能使用的每一个 Ubuntu nvidia-driver* 都至少要保持 CUDA 版本 11.7
  • 使用主线回滚到任意旧内核版本(5.15)
  • 前滚至内核 6.4
  • 使用连接到 GPU 2 的 HDMI 显示器进行启动

*请注意,所有较旧的 Ubuntu nvidia-drivers-5XX 都是过渡到 525 或 535 的软件包(apt search nvidia-driver)。我使用的两个 GPU 都工作的最后一个驱动程序是 515。

列出的单个 GPU(也是我的显示 GPU)确实运行 CUDA 工作负载,但当作业(PyTorch)启动几分钟时似乎会使我的系统不稳定/滞后。

❯ uname -r
5.19.0-46-generic
❯ lspci | grep VGA
09:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
43:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
❯ nvidia-smi
Sat Jul  1 12:11:41 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:43:00.0  On |                  N/A |
|  0%   41C    P8    24W / 350W |    562MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1879      G   /usr/lib/xorg/Xorg                140MiB |
|    0   N/A  N/A      2338    C+G   ...ome-remote-desktop-daemon      258MiB |
|    0   N/A  N/A      2375      G   /usr/bin/gnome-shell               87MiB |
|    0   N/A  N/A      3338      G   ...566776601308618822,262144       73MiB |
+-----------------------------------------------------------------------------+

dmesg 链接至 GitHub Gist

奇怪的是,在全新安装 CUDA(不局限于单个驱动程序版本)并重启后,第二个 GPU 确实会出现nvidia-smi。但重启后它又消失了。卸载并重新安装 CUDA 可以复制此情况,但它似乎是随机的(而且不是我每次重启时想要做的)

有什么办法可以让我的机器再次正常工作吗?

链接至 nvidia-bug-report

答案1

sudo dkms autoinstall

重建 Nvidia 内核模块可能会有所帮助。

答案2

截至今天,我发现的唯一解决方案是使用mainline安装内核版本 5.15。这恢复了我的第二个 GPU nvidia-smi

我不知道为什么当前的 22.04.2 LTS 映像使用 5.19,因为它声明这里22.04 LTS 应该是 5.15。常规更新居然会造成这个问题,这很奇怪 - 我确信人们使用 LTS 版本的主要原因就是为了避免这种问题。

编辑:基于发行说明

Ubuntu Desktop 将在最新一代认证设备 (linux-oem-22.04) 上自动选择使用 v5.17 内核

Ubuntu Server 默认采用非滚动 LTS 内核 v5.15 (linux-generic)

因此看起来 5.15 可能仅适用于 Ubuntu Server,而 Ubuntu Desktop 使用滚动内核。可惜当前内核似乎损坏了某些东西……

答案3

@Anjum Sayed,您能详细说明一下您是如何恢复它的吗?我正在使用双启动的 Windows 10/Ubuntu 20.04 桌面,遇到了同样的问题,我再也看不到 GPU RTX 3090 核心了:

Loading new nvidia-465.19.01 DKMS files…
Building for 5.15.0-76-generic
Building for architecture x86_64
Building initial module for 5.15.0-76-generic
ERROR: Cannot create report: [Errno 17] File exists: ‘/var/crash/nvidia-dkms-465.0.crash’
Error! Bad return status for module build on kernel: 5.15.0-76-generic (x86_64)
Consult /var/lib/dkms/nvidia/465.19.01/build/make.log for more information.
dpkg: error processing package nvidia-dkms-465 (–configure):
installed nvidia-dkms-465 package post-installation script subprocess returned error exit status 10
dpkg: dependency problems prevent configuration of cuda-drivers-465:
cuda-drivers-465 depends on nvidia-dkms-465 (>= 465.19.01); however:
Package nvidia-dkms-465 is not configured yet.

dpkg: error processing package cuda-drivers-465 (–configure):
dependency problems - leaving unconfigured
No apport report written because the error message indicates its a followup error from a previous failure.
No apport report written because the error message indicates its a followup error from a previous failure.
dpkg: dependency problems prevent configuration of cuda-drivers:
cuda-drivers depends on cuda-drivers-465 (= 465.19.01-1); however:
Package cuda-drivers-465 is not configured yet.

dpkg: error processing package cuda-drivers (–configure):
dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of nvidia-driver-465:
nvidia-driver-465 depends on nvidia-dkms-465 (= 465.19.01-0ubuntu1); however:
Package nvidia-dkms-465 is not configured yet.

dpkg: error processing package nvidia-driver-465 (–configure):
dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of cuda-runtime-11-3:
cuda-runtime-11-3 depends on cuda-drivers (>= 465.19.01); however:
Package cuda-drivers is not configured yet.

dpkg: error processing package cuda-runtime-11-3 (–configure):
dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of cuda-demo-suite-11-3:
cuda-demo-suite-11-3 depends on cuda-runtimeNo apport report written because MaxReports is reached already
No apport report written because MaxReports is reached already
No apport report written because MaxReports is reached already
-11-3; however:
Package cuda-runtime-11-3 is not configured yet.

dpkg: error processing package cuda-demo-suite-11-3 (–configure):
dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of cuda-11-3:
cuda-11-3 depends on cuda-runtime-11-3 (>= 11.3.1); however:
Package cuda-runtime-11-3 is not configured yet.
cuda-11-3 depends on cuda-demo-suite-11-3 (>= 11.3.58); however:
Package cuda-demo-suite-11-3 is not configured yet.

dpkg: error processing package cuda-11-3 (–configure):
dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of cuda:
cuda depends on cuda-11-3 (>= 11.3.1); however:
Package cuda-11-3 is not configured yet.

No apport report written because MaxReports is reached already
No apport report written because MaxReports is reached already
dpkg: error processing package cuda (–configure):
dependency problems - leaving unconfigured
Processing triggers for initramfs-tools (0.136ubuntu6.7) …
update-initramfs: Generating /boot/initrd.img-5.15.0-76-generic
Errors were encountered while processing:
nvidia-dkms-465
cuda-drivers-465
cuda-drivers
nvidia-driver-465
cuda-runtime-11-3
cuda-demo-suite-11-3
cuda-11-3
cuda
E: Sub-process /usr/bin/dpkg returned an error code (1)

相关内容