我正在尝试安装 cuda 11.1,包括运行时 api 和我的 gpu。
我正在运行 Ubuntu x86_64 18.04。我尝试将我的 Cuda 运行时升级到 11.1,但未能成功。驱动程序已更新,但我的运行时 API 尚未更新。
nvidia-smi
显示我已经升级到 11.0,但是
nvcc -V
显示为运行时 API 安装的版本 10.0.130。
按照以下指示 https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
我将按照指南中列出的顺序介绍这些命令。
第 2 节 安装前操作
lspci | grep -i nvidia
导致
19:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
19:00.1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
19:00.2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
19:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)
1a:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
1a:00.1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
1a:00.2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
1a:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)
67:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
67:00.1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
67:00.2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
67:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)
68:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
68:00.1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
68:00.2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
68:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)
uname -m && cat /etc/*release
导致
x86_64
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.3 LTS"
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
gcc --version
结果
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
uname -r
结果是
5.4.0-51-generic
sudo apt-get install linux-headers-$(uname -r)
结果是
Reading package lists... Done
Building dependency tree
Reading state information... Done
linux-headers-5.4.0-51-generic is already the newest version (5.4.0-51.56~18.04.1).
linux-headers-5.4.0-51-generic set to manually installed.
The following packages were automatically installed and are no longer required:
dkms libaccinj64-10.0 libatomic1:i386 libboost-python1.65.1 libbsd0:i386 libc-ares2 libcublas10.0 libcudnn7 libcufft10.0 libcufftw10.0 libcuinj64-10.0 libcupti-dev libcupti-doc libcupti10.0 libcurand10.0
libcusolver10.0 libcusparse10.0 libdrm-amdgpu1:i386 libdrm-intel1:i386 libdrm-nouveau2:i386 libdrm-radeon1:i386 libdrm2:i386 libedit2:i386 libelf1:i386 libexpat1:i386 libffi6:i386 libgflags2.2 libgl1:i386
libgl1-mesa-dri:i386 libglapi-mesa:i386 libglvnd0:i386 libglx-mesa0:i386 libglx0:i386 libgoogle-glog0v5 libgrpc7 libjs-sphinxdoc libleveldb1v5 libllvm10:i386 liblmdb0 libnppc10.0 libnppial10.0 libnppicc10.0
libnppicom10.0 libnppidei10.0 libnppif10.0 libnppig10.0 libnppim10.0 libnppist10.0 libnppisu10.0 libnppitc10.0 libnpps10.0 libnvblas10.0 libnvgraph10.0 libnvidia-cfg1-450 libnvidia-common-450
libnvidia-compute-450:i386 libnvidia-decode-450 libnvidia-decode-450:i386 libnvidia-encode-450 libnvidia-encode-450:i386 libnvidia-extra-450 libnvidia-extra-450:i386 libnvidia-fbc1-450 libnvidia-fbc1-450:i386
libnvidia-gl-450 libnvidia-gl-450:i386 libnvidia-ifr1-450 libnvidia-ifr1-450:i386 libnvrtc10.0 libnvtoolsext1 libnvvm3 libpciaccess0:i386 libprotobuf18 libprotoc18 libsensors4:i386 libsleef3 libstdc++6:i386
libthrust-dev libvdpau-dev libx11-6:i386 libx11-xcb1:i386 libxau6:i386 libxcb-dri2-0:i386 libxcb-dri3-0:i386 libxcb-glx0:i386 libxcb-present0:i386 libxcb-sync1:i386 libxcb1:i386 libxdamage1:i386 libxdmcp6:i386
libxext6:i386 libxfixes3:i386 libxnvctrl0 libxshmfence1:i386 libxxf86vm1:i386 pkg-config protobuf-compiler python-absl python-astor python-cffi python-configparser python-future python-gast python-grpcio
python-leveldb python-networkx python-pasta python-ply python-protobuf python-pycparser python-pywt python-skimage python-skimage-lib python-termcolor python-typing python-wrapt python3-absl python3-astor
python3-cffi python3-future python3-gast python3-grpcio python3-leveldb python3-markdown python3-networkx python3-pasta python3-ply python3-pycparser python3-pyinotify python3-pywt python3-skimage python3-skimage-lib
python3-tensorflow-serving python3-termcolor python3-werkzeug python3-wrapt screen-resolution-extra xserver-xorg-video-nvidia-450
Use 'sudo apt autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 179 not upgraded.
2.7. 处理冲突的安装方法
我运行了以下命令
sudo /usr/bin/nvidia-uninstall
sudo apt-get --purge remove cuda*
sudo apt-get --purge remove nvidia*
sudo apt-get --purge remove libcuda*
我尝试寻找
sudo /usr/local/cuda-X.Y/bin/uninstall_cuda_X.Y.pl
但是在bin中没有任何具有该名称的文件,因此我认为之前的cuda不是用runfile安装的。
我检查了和nvidia-smi
,nvcc -V
两次都没有找到命令,但是当我运行安装程序时,我不断收到一条警告消息,有一个以前的安装程序,
发现驱动程序的现有包管理器安装。强烈建议您在继续之前将其删除。
所以我尝试了一些其他方法来删除 cuda 安装
sudo apt-get --purge remove cuda-11.0
sudo apt-get --purge remove cuda-11.1
sudo apt-get --purge remove cuda-10.0
sudo apt-get purge nvidia*
sudo apt-get remove --purge cuda-* libcuda* nvidia*
sudo rm /etc/apt/sources.list.d/cuda*
sudo apt remove --autoremove nvidia-cuda-toolkit
sudo dpkg -l | grep nvidia
sudo apt purge cuda
sudo apt purge -y nvidia
sudo apt remove -y nvidia-*
sudo rm /etc/apt/sources.list.d/cuda*
sudo apt autoremove -y && apt autoclean -y
sudo rm -rf /usr/local/cuda*
第六节 运行文件安装
6.3.禁用 Nouveau
我运行了以下命令
touch /etc/modprobe.d/blacklist-nouveau.conf
并补充说
blacklist nouveau
options nouveau modeset=0
到该文件。然后我执行
update-initramfs: Generating /boot/initrd.img-5.4.0-52-generic
这导致了
update-initramfs: Generating /boot/initrd.img-5.4.0-52-generic
然后我测试lsmod | grep nouveau
它是否打印任何东西,但是没有。
然后我尝试了这个安装
给出了这些命令
wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda_11.1.0_455.23.05_linux.run
sudo sh cuda_11.1.0_455.23.05_linux.run
我下载了安装程序并运行sudo sh cuda_11.1.0_455.23.05_linux.run
导致了这条消息
Installation failed. See log at /var/log/cuda-installer.log for details.
我打开了那个文件,内容如下
[INFO]: Driver not installed.
[INFO]: Checking compiler version...
[INFO]: gcc location: /usr/bin/gcc
[INFO]: gcc version: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
[INFO]: Initializing menu
[INFO]: Setup complete
[INFO]: Components to install:
[INFO]: Driver
[INFO]: 455.23.05
[INFO]: Executing NVIDIA-Linux-x86_64-455.23.05.run --ui=none --no-questions --accept-license --disable-nouveau --no-cc-version-check --install-libglvnd 2>&1
[INFO]: Finished with code: 256
[ERROR]: Install of driver component failed.
[ERROR]: Install of 455.23.05 failed, quitting
因此看起来驱动程序安装失败了。我不确定是什么导致了这个错误,因为 11.0 之前已经安装在 GPU 上了。
然后我尝试通过 deb 安装
给出了这些命令
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu1804-11-1-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda
最后一个命令似乎出错了,其余命令似乎运行正常,没有问题。这是最后一个命令的输出sudo apt-get -y install cuda
,它给出了这个输出
`Reading package lists... Done
Building dependency tree
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:
The following packages have unmet dependencies:
cuda : Depends: cuda-11-1 (>= 11.1.0) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.
在尝试排除驱动程序安装故障时,我发现这sudo apt install nvidia-450-dev
可能会起作用,所以我尝试了一下,而且成功了
nvidia-smi
显示以下内容
Mon Oct 26 18:27:49 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66 Driver Version: 450.66 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:19:00.0 Off | N/A |
| 22% 31C P8 1W / 250W | 6MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:1A:00.0 Off | N/A |
| 22% 35C P8 4W / 250W | 6MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 208... Off | 00000000:67:00.0 Off | N/A |
| 22% 37C P8 6W / 250W | 6MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 208... Off | 00000000:68:00.0 Off | N/A |
| 22% 39C P8 1W / 250W | 26MiB / 11016MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1314 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 1314 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 1314 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 1314 G /usr/lib/xorg/Xorg 9MiB |
| 3 N/A N/A 1653 G /usr/bin/gnome-shell 14MiB |
+-----------------------------------------------------------------------------+
但是,该驱动程序适用于 11.0,而不是 11.1。
因此我尝试安装旧版本的 cuda,11.0 而不是 11.1。
这仅适用于驱动程序,而不适用于运行时 API。
运行后nvcc -V
出现“bash: /usr/bin/nvcc: 没有此文件或目录”
然后我尝试安装 11.0,因为运行时 API 应该低于或等于驱动程序版本。
从
https://developer.nvidia.com/cuda-11.0-download-archive
它给出了以下命令,
wget http://developer.download.nvidia.com/compute/cuda/11.0.2/local_installers/cuda_11.0.2_450.51.05_linux.run
sudo sh cuda_11.0.2_450.51.05_linux.run
下载安装程序后,运行sudo sh cuda_11.0.2_450.51.05_linux.run
首先,系统警告我再次安装了旧版本,可能是驱动程序安装时出现的。我选择继续,因为我只安装工具包而不安装驱动程序。我继续,并选择安装除驱动程序之外的所有内容
CUDA Installer │
│ - [ ] Driver │
│ [ ] 450.51.05 │
│ + [X] CUDA Toolkit 11.0 │
│ [X] CUDA Samples 11.0 │
│ [X] CUDA Demo Suite 11.0 │
│ [X] CUDA Documentation 11.0 │
│ Options │
│ Install │
│ │
│ │
│
安装后,我收到此消息
===========
= Summary =
===========
Driver: Not Selected
Toolkit: Installed in /usr/local/cuda-11.0/
Samples: Installed in /home/santosh/, but missing recommended libraries
Please make sure that
- PATH includes /usr/local/cuda-11.0/bin
- LD_LIBRARY_PATH includes /usr/local/cuda-11.0/lib64, or, add /usr/local/cuda-11.0/lib64 to /etc/ld.so.conf and run ldconfig as root
To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-11.0/bin
Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-11.0/doc/pdf for detailed information on setting up CUDA.
***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least .00 is required for CUDA 11.0 functionality to work.
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
sudo <CudaInstaller>.run --silent --driver
Logfile is /var/log/cuda-installer.log
我将 /usr/local/cuda-11.0/bin 添加到 PATH 并将 LD_LIBRARY_PATH 设置为 /usr/local/cuda-11.0/lib64
然后我尝试了此处的安装后说明https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#power9-setup
systemctl status nvidia-persistenced
导致“找不到单元 nvidia-persistenced.service”。
sudo systemctl enable nvidia-persistenced
导致
The unit files have no installation config (WantedBy, RequiredBy, Also, Alias
settings in the [Install] section, and DefaultInstance for template units).
This means they are not meant to be enabled using systemctl.
Possible reasons for having this kind of units are:
1) A unit may be statically enabled by being symlinked from another unit's
.wants/ or .requires/ directory.
2) A unit's purpose may be to act as a helper for some other unit which has
a requirement dependency on it.
3) A unit may be started when needed via activation (socket, path, timer,
D-Bus, udev, scripted systemctl call, ...).
4) In case of template units, the unit is meant to be enabled with some
instance name specified.
我可以毫无问题地执行 udeve 规则指令;我运行了以下命令
sudo cp /lib/udev/rules.d/40-vm-hotadd.rules /etc/udev/rules.d
sudo sed -i '/SUBSYSTEM=="memory", ACTION=="add"/d' /etc/udev/rules.d/40-vm-hotadd.rules
我nvcc -V
只是想检查一下安装是否以其他方式成功。这次我收到了这条消息
Command 'nvcc' not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit
所以我尝试了该命令,它似乎安装没有问题。当我nvcc -V
再次运行时,我收到了此消息
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
这是我最初使用的 CUDA 版本。
看到这条消息
https://forums.developer.nvidia.com/t/cuda-10-installation-problems-on-ubuntu-18-04/68615
按照 Linux 安装指南中的说明进行操作:https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html836
获取安装程序http://www.nvidia.com/getcuda267
既然你已经安装了错误的驱动程序,请仔细阅读 Linux 安装指南。不仔细遵循将导致更多麻烦。
似乎不推荐使用其他方式安装到 gpu 和工具包(使用sudo apt install nvidia-450-dev
和sudo apt install nvidia-cuda-toolkit)
),而应严格遵循说明指南。
但是,我按照说明操作,却无法安装到驱动程序上。驱动程序安装似乎并非不可能,因为替代命令以某种方式起作用了,但错误日志没有让我了解如何以官方方式安装它。
答案1
我解决了这个问题。硬件自带了 cuda 安装文件,我对此并不知情。一旦阻止了这些文件,安装就可以完美运行。