Ubuntu 版本 20.04 LTS
NVIDIA 驱动程序和相关软件包(如 cuda)均已正确安装。nvidia-smi 和 cuda 代码运行正常。
Docker 相关的 NVIDIA 软件包也已安装(NVIDIA Container Toolkit)。最初的问题是,如果我尝试在 docker 中验证 NVIDIA 支持,我会收到以下错误消息:
$ sudo docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
在找到一些在线讨论后,我尝试按照此处的说明重新安装 docker: https://docs.docker.com/engine/install/ubuntu/ 它对我有用。NVIDIA 现在可以在 docker 下工作。
但是,重启后它将停止工作。我必须执行以下操作:
$ sudo apt-get reinstall docker-ce docker-ce-cli containerd.io
为了让 NVIDIA 再次在 docker 下工作。可以确认每次重启都会导致这种情况。
我该如何让它工作以便不必每次重启后都重新安装?
答案1
就我而言,我通过 snap 和 apt 包管理器安装了两次 docker:
重启后我有:
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
ubuntu latest 4e2eef94cd6b 3 weeks ago 73.9MB
tensorflow/tensorflow latest-gpu-jupyter f0b0261fec71 6 weeks ago 3.3GB
nvidia/cuda 10.0-base 841d44dd4b3c 9 months ago 110MB
如果我重新启动docker服务:
$ sudo service docker restart
我还有其他一组图像:
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
jupyter/r-notebook latest 14611e3d9838 2 weeks ago 2.59GB
ubuntu latest 4e2eef94cd6b 3 weeks ago 73.9MB
tensorflow/tensorflow latest-gpu-jupyter f0b0261fec71 6 weeks ago 3.3GB
$ dpkg -l | grep docker
ii docker-ce 5:19.03.12~3-0~ubuntu-focal amd64 Docker: the open-source application container engine
ii docker-ce-cli 5:19.03.12~3-0~ubuntu-focal amd64 Docker CLI: the open-source application container engine
$ snap list | grep docker
docker 19.03.11 471 latest/stable canonical* -
我重新启动了操作系统:
$ sudo init 6
我删除了通过 snap docker 创建的所有图像:
$ docker rmi $(docker images -q)
之后我删除了 snap docker:
$ sudo snap remove docker
$ sudo init 6
现在我有一个正常运行的docker服务:
$ docker run --gpus all -p 8888:8888 -v /tf:/tf -w /tf --name tfgpu --rm tensorflow/tensorflow:latest-gpu-jupyter
[I 07:52:52.707 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
[I 07:52:52.967 NotebookApp] Serving notebooks from local directory: /tf
[I 07:52:52.967 NotebookApp] The Jupyter Notebook is running at:
[I 07:52:52.967 NotebookApp] http://a1d1932a7004:8888/?token=74b0b061e2a1818b865c1f344be904758f9f0dba73b742d3
[I 07:52:52.967 NotebookApp] or http://127.0.0.1:8888/?token=74b0b061e2a1818b865c1f344be904758f9f0dba73b742d3
[I 07:52:52.967 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 07:52:52.972 NotebookApp]
To access the notebook, open this file in a browser:
file:///root/.local/share/jupyter/runtime/nbserver-1-open.html
Or copy and paste one of these URLs:
http://a1d1932a7004:8888/?token=74b0b061e2a1818b865c1f344be904758f9f0dba73b742d3
or http://127.0.0.1:8888/?token=74b0b061e2a1818b865c1f344be904758f9f0dba73b742d3