在 Ubuntu 20.04 上安装 nvidia-driver-510 后无法识别 GTX 2060 和 GTX 3060(我有 2 个 Nvidia GPU 和 1 个 Intel GPU)
这是我更换硬盘后的新系统。nvidia-dirver-510 与以前的系统中的 2060 和 3060 配合良好。
$ lspci|grep -i vga
00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor Graphics Controller (rev 09)
01:00.0 VGA compatible controller: NVIDIA Corporation Device 2504 (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation Device 1f03 (rev a1)
$ uname -a
Linux CMPLTRTOK-U20 5.15.0-72-generic #79~20.04.1-Ubuntu SMP Thu Apr 20 22:12:07 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
$ apt list linux-headers-$(uname -r)
Listing... Done
linux-headers-5.15.0-72-generic/focal-updates,focal-security,now 5.15.0-72.79~20.04.1 amd64 [installed,automatic]
$ apt list nvidia-driver-*
Listing... Done
nvidia-driver-390/focal-updates,focal-security 390.157-0ubuntu0.20.04.1 amd64
nvidia-driver-390/focal-updates,focal-security 390.157-0ubuntu0.20.04.1 i386
nvidia-driver-418-server/focal-updates,focal-security 418.226.00-0ubuntu0.20.04.2 amd64
nvidia-driver-418/focal 430.50-0ubuntu3 amd64
nvidia-driver-430/focal-updates,focal-security 440.100-0ubuntu0.20.04.1 amd64
nvidia-driver-435/focal-updates 455.45.01-0ubuntu0.20.04.1 amd64
nvidia-driver-440-server/focal-updates,focal-security 450.236.01-0ubuntu0.20.04.1 amd64
nvidia-driver-440/focal-updates,focal-security 450.119.03-0ubuntu0.20.04.1 amd64
nvidia-driver-450-server/focal-updates,focal-security 450.236.01-0ubuntu0.20.04.1 amd64
nvidia-driver-450/focal-updates,focal-security 460.91.03-0ubuntu0.20.04.1 amd64
nvidia-driver-455/focal-updates,focal-security 460.91.03-0ubuntu0.20.04.1 amd64
nvidia-driver-460-server/focal-updates,focal-security 470.182.03-0ubuntu0.20.04.1 amd64
nvidia-driver-460/focal-updates,focal-security 470.182.03-0ubuntu0.20.04.1 amd64
nvidia-driver-465/focal-updates,focal-security 470.182.03-0ubuntu0.20.04.1 amd64
nvidia-driver-470-server/focal-updates,focal-security 470.182.03-0ubuntu0.20.04.1 amd64
nvidia-driver-470/focal-updates,focal-security 470.182.03-0ubuntu0.20.04.1 amd64
nvidia-driver-495/focal-updates,focal-security 510.108.03-0ubuntu0.20.04.1 amd64
nvidia-driver-510-server/focal-updates,focal-security 515.105.01-0ubuntu0.20.04.1 amd64
nvidia-driver-510/focal-updates,focal-security,now 510.108.03-0ubuntu0.20.04.1 amd64 [installed]
nvidia-driver-515-open/focal-updates,focal-security 515.105.01-0ubuntu0.20.04.1 amd64
nvidia-driver-515-server/focal-updates,focal-security 515.105.01-0ubuntu0.20.04.1 amd64
nvidia-driver-515/focal-updates,focal-security 515.105.01-0ubuntu0.20.04.1 amd64
nvidia-driver-520-open/focal-updates,focal-security 525.116.04-0ubuntu0.20.04.1 amd64
nvidia-driver-520/focal-updates,focal-security 525.116.04-0ubuntu0.20.04.1 amd64
nvidia-driver-525-open/focal-updates,focal-security 525.116.04-0ubuntu0.20.04.1 amd64
nvidia-driver-525-server/focal-updates,focal-security 525.105.17-0ubuntu0.20.04.1 amd64
nvidia-driver-525/focal-updates,focal-security 525.116.04-0ubuntu0.20.04.1 amd64
nvidia-driver-530-open/focal-updates,focal-security 530.41.03-0ubuntu0.20.04.2 amd64
nvidia-driver-530/focal-updates,focal-security 530.41.03-0ubuntu0.20.04.2 amd64
$ apt list libnvidia*-510*
Listing... Done
libnvidia-cfg1-510-server/focal-updates,focal-security 515.105.01-0ubuntu0.20.04.1 amd64
libnvidia-cfg1-510/focal-updates,focal-security,now 510.108.03-0ubuntu0.20.04.1 amd64 [installed,automatic]
libnvidia-common-510-server/focal-updates,focal-updates,focal-security,focal-security 515.105.01-0ubuntu0.20.04.1 all
libnvidia-common-510/focal-updates,focal-updates,focal-security,focal-security,now 510.108.03-0ubuntu0.20.04.1 all [installed,automatic]
libnvidia-compute-510-server/focal-updates,focal-security 515.105.01-0ubuntu0.20.04.1 amd64
libnvidia-compute-510-server/focal-updates,focal-security 515.105.01-0ubuntu0.20.04.1 i386
libnvidia-compute-510/focal-updates,focal-security,now 510.108.03-0ubuntu0.20.04.1 amd64 [installed,automatic]
libnvidia-compute-510/focal-updates,focal-security,now 510.108.03-0ubuntu0.20.04.1 i386 [installed,automatic]
libnvidia-decode-510-server/focal-updates,focal-security 515.105.01-0ubuntu0.20.04.1 amd64
libnvidia-decode-510-server/focal-updates,focal-security 515.105.01-0ubuntu0.20.04.1 i386
libnvidia-decode-510/focal-updates,focal-security,now 510.108.03-0ubuntu0.20.04.1 amd64 [installed,automatic]
libnvidia-decode-510/focal-updates,focal-security,now 510.108.03-0ubuntu0.20.04.1 i386 [installed,automatic]
libnvidia-encode-510-server/focal-updates,focal-security 515.105.01-0ubuntu0.20.04.1 amd64
libnvidia-encode-510-server/focal-updates,focal-security 515.105.01-0ubuntu0.20.04.1 i386
libnvidia-encode-510/focal-updates,focal-security,now 510.108.03-0ubuntu0.20.04.1 amd64 [installed,automatic]
libnvidia-encode-510/focal-updates,focal-security,now 510.108.03-0ubuntu0.20.04.1 i386 [installed,automatic]
libnvidia-extra-510-server/focal-updates,focal-security 515.105.01-0ubuntu0.20.04.1 amd64
libnvidia-extra-510-server/focal-updates,focal-security 515.105.01-0ubuntu0.20.04.1 i386
libnvidia-extra-510/focal-updates,focal-security,now 510.108.03-0ubuntu0.20.04.1 amd64 [installed,automatic]
libnvidia-extra-510/focal-updates,focal-security 510.108.03-0ubuntu0.20.04.1 i386
libnvidia-fbc1-510-server/focal-updates,focal-security 515.105.01-0ubuntu0.20.04.1 amd64
libnvidia-fbc1-510-server/focal-updates,focal-security 515.105.01-0ubuntu0.20.04.1 i386
libnvidia-fbc1-510/focal-updates,focal-security,now 510.108.03-0ubuntu0.20.04.1 amd64 [installed,automatic]
libnvidia-fbc1-510/focal-updates,focal-security,now 510.108.03-0ubuntu0.20.04.1 i386 [installed,automatic]
libnvidia-gl-510-server/focal-updates,focal-security 515.105.01-0ubuntu0.20.04.1 amd64
libnvidia-gl-510-server/focal-updates,focal-security 515.105.01-0ubuntu0.20.04.1 i386
libnvidia-gl-510/focal-updates,focal-security,now 510.108.03-0ubuntu0.20.04.1 amd64 [installed,automatic]
libnvidia-gl-510/focal-updates,focal-security,now 510.108.03-0ubuntu0.20.04.1 i386 [installed,automatic]
libnvidia-nscq-510/focal-updates,focal-security 515.105.01-0ubuntu0.20.04.1 amd64
$ nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Thu Jun 22 17:05:49 2023
Driver Version : 510.108.03
CUDA Version : 11.6
Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : NVIDIA GeForce RTX 3060
Product Brand : GeForce
Product Architecture : Ampere
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-5f52334c-826a-7900-978b-7fcb937de6ea
Minor Number : 0
VBIOS Version : 94.06.2F.00.9A
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : N/A
Module ID : 0
Inforom Version
Image Version : G001.0000.03.03
OEM Object : 2.0
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x250410DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x397D1462
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 12288 MiB
Reserved : 235 MiB
Used : 3 MiB
Free : 12049 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 3 MiB
Free : 253 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 43 C
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 93 C
GPU Target Temperature : 83 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 15.30 W
Power Limit : 170.00 W
Default Power Limit : 170.00 W
Enforced Power Limit : 170.00 W
Min Power Limit : 100.00 W
Max Power Limit : 170.00 W
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 555 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 2130 MHz
SM : 2130 MHz
Memory : 7501 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 662.500 mV
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 1151
Type : G
Name : /usr/bin/gnome-shell
Used GPU Memory : 2 MiB
答案1
我不知道为什么,但是在我发布这个问题后,系统重启两次后它就起作用了。
以下是详细信息,我写下来希望它能为其他人提供线索,也希望专家能解释根本原因。(我提供了一些有关 nvtop 安装的详细信息,我相信以前这并不重要,但现在猜想很重要。)
细节:
(1) 我有一台 Ubuntu 20.04,配备 RTX 3060(PCI-E 3.0x16)和英特尔集成显卡。后来我通过 PCI-E 2.0x4 将 RTX 2060 连接到系统。
(2)大约一周前,我在立即重启后通过 apt 安装了 nvidia-driver-510。
(3) 我昨天安装了 nvtop,以便在验证 GPU 状态之前对其进行监控。安装 nvtop 时,我从 git 获取源代码并进行构建。我根据 nvtop cmake 的构建错误通过 apt 安装了以下软件包:
libsystemd-dev
libdrm-dev
libgtest-开发环境
libudev-dev
(4) 我通过运行 pytorch (torch-1.10.1+cu113-p38-linux) 的分布式数据并行来验证 GPU。但是我发现 pytorch 找不到任何 GPU。此时,nvtop 只找到 Intel GPU 和 RTX 3060,nvidia-smi 只找到 RTX 3060,但命令 lspci 找到 Intel GPU 和两个 nvidia GPU。然后我发了这个问题,关机,去睡觉。
(5)(第一次重启。)我从睡眠状态唤醒后打开了系统。nvtop 找到了 Intel GPU 和 2 个 nvidia GPU,而 nvidia-smi 出现以下错误:
nvidia-smi -q
Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error
(6) 我像以前一样通过 pytorch 进行了测试,它抱怨 CUDA 错误。它卡住了,我无法中止它,甚至无法用 -9 杀死它。它占用了我 CPU 的一个核心的 100%。即使我运行了“sudo poweroff”,系统仍然处于开启状态。然后我物理关闭了它。
(7)(第二次重启。)在我安装 nvtop 后第二次重启。nvtop、nvidia-smi 和 pytorch 代码都正常。我通过运行 pytorch 代码和使用 nvtop 监控来验证它们。它有效!