Docker 仅在重新安装后才适用于 Nvidia 驱动程序

Docker 仅在重新安装后才适用于 Nvidia 驱动程序

Ubuntu 版本 20.04 LTS

NVIDIA 驱动程序和相关软件包(如 cuda)均已正确安装。nvidia-smi 和 cuda 代码运行正常。

Docker 相关的 NVIDIA 软件包也已安装(NVIDIA Container Toolkit)。最初的问题是,如果我尝试在 docker 中验证 NVIDIA 支持,我会收到以下错误消息:

$ sudo docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

在找到一些在线讨论后,我尝试按照此处的说明重新安装 docker: https://docs.docker.com/engine/install/ubuntu/ 它对我有用。NVIDIA 现在可以在 docker 下工作。

但是,重启后它将停止工作。我必须执行以下操作:

$ sudo apt-get reinstall docker-ce docker-ce-cli containerd.io

为了让 NVIDIA 再次在 docker 下工作。可以确认每次重启都会导致这种情况。

我该如何让它工作以便不必每次重启后都重新安装?

答案1

就我而言,我通过 snap 和 apt 包管理器安装了两次 docker:

重启后我有:

$ docker images
REPOSITORY              TAG                  IMAGE ID            CREATED             SIZE
ubuntu                  latest               4e2eef94cd6b        3 weeks ago         73.9MB
tensorflow/tensorflow   latest-gpu-jupyter   f0b0261fec71        6 weeks ago         3.3GB
nvidia/cuda             10.0-base            841d44dd4b3c        9 months ago        110MB

如果我重新启动docker服务:

$ sudo service docker restart

我还有其他一组图像:

$ docker images
REPOSITORY              TAG                  IMAGE ID            CREATED             SIZE
jupyter/r-notebook      latest               14611e3d9838        2 weeks ago         2.59GB
ubuntu                  latest               4e2eef94cd6b        3 weeks ago         73.9MB
tensorflow/tensorflow   latest-gpu-jupyter   f0b0261fec71        6 weeks ago         3.3GB

$ dpkg -l | grep docker
ii  docker-ce                                  5:19.03.12~3-0~ubuntu-focal           amd64        Docker: the open-source application container engine
ii  docker-ce-cli                              5:19.03.12~3-0~ubuntu-focal           amd64        Docker CLI: the open-source application container engine

$ snap list | grep docker
docker     19.03.11     471    latest/stable  canonical*          -    

我重新启动了操作系统:

$ sudo init 6

我删除了通过 snap docker 创建的所有图像:

$ docker rmi $(docker images -q)

之后我删除了 snap docker:

$ sudo snap remove docker
$ sudo init 6

现在我有一个正常运行的docker服务:

$ docker run --gpus all -p 8888:8888 -v /tf:/tf -w /tf --name tfgpu --rm tensorflow/tensorflow:latest-gpu-jupyter
[I 07:52:52.707 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
[I 07:52:52.967 NotebookApp] Serving notebooks from local directory: /tf
[I 07:52:52.967 NotebookApp] The Jupyter Notebook is running at:
[I 07:52:52.967 NotebookApp] http://a1d1932a7004:8888/?token=74b0b061e2a1818b865c1f344be904758f9f0dba73b742d3
[I 07:52:52.967 NotebookApp]  or http://127.0.0.1:8888/?token=74b0b061e2a1818b865c1f344be904758f9f0dba73b742d3
[I 07:52:52.967 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 07:52:52.972 NotebookApp] 

    To access the notebook, open this file in a browser:
        file:///root/.local/share/jupyter/runtime/nbserver-1-open.html
    Or copy and paste one of these URLs:
        http://a1d1932a7004:8888/?token=74b0b061e2a1818b865c1f344be904758f9f0dba73b742d3
     or http://127.0.0.1:8888/?token=74b0b061e2a1818b865c1f344be904758f9f0dba73b742d3

相关内容