安装 cuda 后,Debian Bullseye 上不存在 nvidia 驱动程序

安装 cuda 后,Debian Bullseye 上不存在 nvidia 驱动程序

我正在尝试在 Debian Bullseye 系统上安装/升级 nvidia gpu 驱动程序和相关软件,但遇到了麻烦。我尝试按照安装 cuda 的说明进行操作,但当我到达步骤 13.2.1“安装持久性守护进程”时,它失败并显示错误:

nvidia-persistenced failed to initialize. Check syslog for more details.
logfile shows:
  Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 0 has read and write permissions for those files.

/dev 中没有 nvidia 文件

/usr/local/ 具有以下内容:

$ ls -dl /usr/local/cuda*
lrwxrwxrwx  1 root root   22 Sep 30 20:15 /usr/local/cuda -> /etc/alternatives/cuda
drwxr-xr-x 16 root root 4096 Jun 16 16:35 /usr/local/cuda-11.3
lrwxrwxrwx  1 root root   25 Sep 30 20:15 /usr/local/cuda-12 -> /etc/alternatives/cuda-12
drwxr-xr-x 15 root root 4096 Sep 30 20:15 /usr/local/cuda-12.2
$ ls -dl /etc/alternatives/cuda*
lrwxrwxrwx 1 root root 20 Sep 30 20:15 /etc/alternatives/cuda -> /usr/local/cuda-12.2
lrwxrwxrwx 1 root root 20 Sep 30 20:15 /etc/alternatives/cuda-12 -> /usr/local/cuda-12.2

gpu 似乎在那里:

 sudo nvidia-smi
Sat Sep 30 21:51:02 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0              49W / 400W |      4MiB / 40960MiB |     26%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

当最初构建这个 GCE 系统时,有一个可以运行的 cuda-11 安装,但我担心我把一切都搞砸了,而且不确定如何继续。

答案1

我不清楚到底出了什么问题,但我通过完全删除已安装的 cuda 和驱动程序(独立删除和在 conda 环境中删除),然后重新安装解决了这个问题。这可能是由于 conda 安装的东西是旧版本,并且最初没有删除它。

$ sudo apt-get --purge remove "*cuda*" "*cublas*" "*cufft*" "*cufile*" "*curand*" "*cusolver*" "*cusparse*" "*gds-tools*" "*npp*" "*nvjpeg*" "nsight*" "*nvvm*"
$ sudo apt-get --purge remove "*nvidia*" "libxnvctrl*"
$ sudo /opt/conda/condabin/conda remove cuda
$ wget https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64/cuda-keyring_1.1-1_all.deb
$ sudo dpkg -i cuda-keyring_1.1-1_all.deb
$ sudo apt-get update
$ sudo apt-get install cuda
  (got message about mis-matched drivers, suggesting reboot)
exit gce shell, stop vm, restart vm, bring up new shell
$ export PATH=/usr/local/cuda-12.2/bin${PATH:+:${PATH}}
$ git clone https://github.com/nvidia/cuda-samples
continue with installation verification by building and running samples

相关内容