我正在尝试在 Debian Bullseye 系统上安装/升级 nvidia gpu 驱动程序和相关软件,但遇到了麻烦。我尝试按照安装 cuda 的说明进行操作,但当我到达步骤 13.2.1“安装持久性守护进程”时,它失败并显示错误:
nvidia-persistenced failed to initialize. Check syslog for more details.
logfile shows:
Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 0 has read and write permissions for those files.
/dev 中没有 nvidia 文件
/usr/local/ 具有以下内容:
$ ls -dl /usr/local/cuda*
lrwxrwxrwx 1 root root 22 Sep 30 20:15 /usr/local/cuda -> /etc/alternatives/cuda
drwxr-xr-x 16 root root 4096 Jun 16 16:35 /usr/local/cuda-11.3
lrwxrwxrwx 1 root root 25 Sep 30 20:15 /usr/local/cuda-12 -> /etc/alternatives/cuda-12
drwxr-xr-x 15 root root 4096 Sep 30 20:15 /usr/local/cuda-12.2
$ ls -dl /etc/alternatives/cuda*
lrwxrwxrwx 1 root root 20 Sep 30 20:15 /etc/alternatives/cuda -> /usr/local/cuda-12.2
lrwxrwxrwx 1 root root 20 Sep 30 20:15 /etc/alternatives/cuda-12 -> /usr/local/cuda-12.2
gpu 似乎在那里:
sudo nvidia-smi
Sat Sep 30 21:51:02 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB Off | 00000000:00:04.0 Off | 0 |
| N/A 32C P0 49W / 400W | 4MiB / 40960MiB | 26% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
当最初构建这个 GCE 系统时,有一个可以运行的 cuda-11 安装,但我担心我把一切都搞砸了,而且不确定如何继续。
答案1
我不清楚到底出了什么问题,但我通过完全删除已安装的 cuda 和驱动程序(独立删除和在 conda 环境中删除),然后重新安装解决了这个问题。这可能是由于 conda 安装的东西是旧版本,并且最初没有删除它。
$ sudo apt-get --purge remove "*cuda*" "*cublas*" "*cufft*" "*cufile*" "*curand*" "*cusolver*" "*cusparse*" "*gds-tools*" "*npp*" "*nvjpeg*" "nsight*" "*nvvm*"
$ sudo apt-get --purge remove "*nvidia*" "libxnvctrl*"
$ sudo /opt/conda/condabin/conda remove cuda
$ wget https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64/cuda-keyring_1.1-1_all.deb
$ sudo dpkg -i cuda-keyring_1.1-1_all.deb
$ sudo apt-get update
$ sudo apt-get install cuda
(got message about mis-matched drivers, suggesting reboot)
exit gce shell, stop vm, restart vm, bring up new shell
$ export PATH=/usr/local/cuda-12.2/bin${PATH:+:${PATH}}
$ git clone https://github.com/nvidia/cuda-samples
continue with installation verification by building and running samples