帮助在 Ubuntu 18.04.4 LTS 上使用 GPU 运行 docker

帮助在 Ubuntu 18.04.4 LTS 上使用 GPU 运行 docker

我正在尝试跟随https://www.tensorflow.org/install/gpu#ubuntu_1804_cuda_101让 docker 在 Ubuntu 18.04.4 LTS 上使用 GPU 工作。

我将把说明复制在此处以供参考:

# Add NVIDIA package repositories
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.1.243-1_amd64.deb
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
sudo dpkg -i cuda-repo-ubuntu1804_10.1.243-1_amd64.deb
sudo apt-get update
wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt-get update

# Install NVIDIA driver
sudo apt-get install --no-install-recommends nvidia-driver-430
# Reboot. Check that GPUs are visible using the command: nvidia-smi

# Install development and runtime libraries (~4GB)
sudo apt-get install --no-install-recommends \
    cuda-10-1 \
    libcudnn7=7.6.4.38-1+cuda10.1  \
    libcudnn7-dev=7.6.4.38-1+cuda10.1


# Install TensorRT. Requires that libcudnn7 is installed above.
sudo apt-get install -y --no-install-recommends libnvinfer6=6.0.1-1+cuda10.1 \
    libnvinfer-dev=6.0.1-1+cuda10.1 \
    libnvinfer-plugin6=6.0.1-1+cuda10.1

我完成上述步骤的一半后出现错误:

$ sudo apt-get install --no-install-recommends nvidia-driver-430
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 nvidia-driver-430 : Depends: libnvidia-gl-430 (= 430.50-0ubuntu0.18.04.2) but it is not going to be installed
                     Depends: nvidia-dkms-430 (= 430.50-0ubuntu0.18.04.2)
                     Depends: nvidia-kernel-source-430 (= 430.50-0ubuntu0.18.04.2) but it is not going to be installed
                     Depends: libnvidia-decode-430 (= 430.50-0ubuntu0.18.04.2) but it is not going to be installed
                     Depends: libnvidia-encode-430 (= 430.50-0ubuntu0.18.04.2) but it is not going to be installed
                     Depends: nvidia-utils-430 (= 430.50-0ubuntu0.18.04.2) but it is not going to be installed
                     Depends: xserver-xorg-video-nvidia-430 (= 430.50-0ubuntu0.18.04.2) but it is not going to be installed
                     Depends: libnvidia-cfg1-430 (= 430.50-0ubuntu0.18.04.2) but it is not going to be installed
                     Depends: libnvidia-ifr1-430 (= 430.50-0ubuntu0.18.04.2) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

我注意到我已经安装了 nvidia 驱动程序,但它不是版本 430:我的apt list --installed包括:

nvidia-384/bionic-updates,now 390.116-0ubuntu0.18.04.3 amd64 [installed,upgradable to: 418.87.01-0ubuntu1]
nvidia-384-dev/bionic-updates,now 390.116-0ubuntu0.18.04.3 amd64 [installed,upgradable to: 418.87.01-0ubuntu1]
nvidia-common/now 1:0.5.3~ppa3 amd64 [installed,local]
nvidia-compute-utils-390/bionic-updates,now 390.116-0ubuntu0.18.04.3 amd64 [installed]
nvidia-container-toolkit/bionic,now 1.0.5-1 amd64 [installed]
nvidia-dkms-390/bionic-updates,now 390.116-0ubuntu0.18.04.3 amd64 [installed]
nvidia-driver-390/bionic-updates,now 390.116-0ubuntu0.18.04.3 amd64 [installed]
nvidia-headless-390/bionic-updates,now 390.116-0ubuntu0.18.04.3 amd64 [installed]
nvidia-headless-no-dkms-390/bionic-updates,now 390.116-0ubuntu0.18.04.3 amd64 [installed]
nvidia-kernel-common-390/bionic-updates,now 390.116-0ubuntu0.18.04.3 amd64 [installed]
nvidia-kernel-source-390/bionic-updates,now 390.116-0ubuntu0.18.04.3 amd64 [installed]
nvidia-libopencl1-384/bionic-updates,now 390.116-0ubuntu0.18.04.3 amd64 [installed,upgradable to: 418.87.01-0ubuntu1]
nvidia-machine-learning-repo-ubuntu1804/unknown,now 1.0.0-1 amd64 [installed]
nvidia-opencl-icd-384/bionic-updates,now 390.116-0ubuntu0.18.04.3 amd64 [installed,upgradable to: 418.87.01-0ubuntu1]
nvidia-prime/now 0.8.9~ppa3 all [installed,local]
nvidia-settings/unknown,now 440.64.00-0ubuntu1 amd64 [installed]
nvidia-utils-390/bionic-updates,now 390.116-0ubuntu0.18.04.3 amd64 [installed]

以下是目前工作:

  • 我可以跑步nvidia-smi。它说我有Driver Version: 390.116
  • 我有Docker version 19.03.8, build afacb8b7f0
  • apt list --installed包括nvidia-container-toolkit/bionic,now 1.0.5-1 amd64 [installed],我按照以下说明安装https://github.com/NVIDIA/nvidia-docker
  • 我的apt list --installed包括cuda-repo-ubuntu1804/unknown,now 10.2.89-1 amd64 [installed]

以下是才不是工作:

$ sudo docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=10.0, please update your driver to a newer version, or use an earlier cuda container\\\\n\\\"\"": unknown.
ERRO[0000] error waiting for container: context canceled 

错误说我需要cuda>=10.0,这就是为什么我试图遵循https://www.tensorflow.org/install/gpu#ubuntu_1804_cuda_101

我应该做什么才能开始sudo docker run --gpus all nvidia/cuda:10.0-base nvidia-smi工作?


编辑:我注意到https://github.com/NVIDIA/nvidia-docker/wiki/Frequently-Asked-Questions#how-do-i-install-the-nvidia-driver说要安装该cuda-drivers软件包。我在尝试安装时收到此错误:

$ sudo apt install cuda-drivers
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 cuda-drivers : Depends: libnvidia-encode-440 (>= 440.64.00) but it is not going to be installed
                Depends: libnvidia-fbc1-440 (>= 440.64.00) but it is not going to be installed
                Depends: libnvidia-ifr1-440 (>= 440.64.00) but it is not going to be installed
                Depends: nvidia-compute-utils-440 (>= 440.64.00) but it is not going to be installed
                Depends: nvidia-dkms-440 (>= 440.64.00)
                Depends: nvidia-driver-440 (>= 440.64.00) but it is not going to be installed
                Depends: xserver-xorg-video-nvidia-440 (>= 440.64.00) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

apt install我的错误( )是否E: Unable to correct problems, you have held broken packages.与我的 /etc/apt 配置中的源有关?

$ rg "cuda" /etc/apt/sources.list.d
/etc/apt/sources.list.d/cuda.list.save
1:deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 /

/etc/apt/sources.list.d/cuda.list
1:deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 /

$ rg "nvidia" /etc/apt/sources.list.d
/etc/apt/sources.list.d/cuda.list.save
1:deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 /

/etc/apt/sources.list.d/nvidia-machine-learning.list
1:deb http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 /

/etc/apt/sources.list.d/cuda.list
1:deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 /

/etc/apt/sources.list.d/nvidia-docker.list.save
1:deb https://nvidia.github.io/libnvidia-container/ubuntu18.04/$(ARCH) /
2:deb https://nvidia.github.io/nvidia-container-runtime/ubuntu18.04/$(ARCH) /
3:deb https://nvidia.github.io/nvidia-docker/ubuntu18.04/$(ARCH) /

/etc/apt/sources.list.d/nvidia-docker.list
1:deb https://nvidia.github.io/libnvidia-container/ubuntu18.04/$(ARCH) /
2:deb https://nvidia.github.io/nvidia-container-runtime/ubuntu18.04/$(ARCH) /
3:deb https://nvidia.github.io/nvidia-docker/ubuntu18.04/$(ARCH) /

答案1

我可以开始sudo docker run --gpus all nvidia/cuda:10.0-base nvidia-smi工作了。

我必须先跑

$ sudo apt-get install libnvidia-compute-430
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages will be REMOVED:
  libnvidia-compute-390 libnvidia-decode-390 libnvidia-encode-390 nvidia-384 nvidia-384-dev nvidia-compute-utils-390 nvidia-driver-390 nvidia-headless-390
  nvidia-headless-no-dkms-390 nvidia-libopencl1-384 nvidia-opencl-icd-384 nvidia-utils-390
The following NEW packages will be installed:
  libnvidia-compute-430
0 upgraded, 1 newly installed, 12 to remove and 15 not upgraded.
Need to get 20.2 MB of archives.
After this operation, 13.0 MB of additional disk space will be used.

之后,我就能跑了sudo apt-get install nvidia-driver-430https://www.tensorflow.org/install/gpu#ubuntu_1804_cuda_101我最初被阻止的步骤)。

nvidia-smi现在说NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1

相关内容