NVIDIA docker，Ubuntu 镜像无法识别 GPU

2024-6-5 • tag-icon

在基于 Ubuntu 的 NVIDIA docker 镜像上，容器无法识别 GPU，但基于 Redhat 的容器可以识别。为什么？我按照官方安装手册使用官方 docker 镜像。我应该向 NVIDIA 询问吗？

环境

Ubuntu 桌面 22.04 LTS
Docker 20.10.21
显卡 RTX 2080
驱动程序 nvidia-driver-510
主机操作系统上未安装 CUDA

命令

# Ubuntu cuda11.8
$ docker run --gpus all -it --rm nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04 /bin/bash
$ nvidia-smi
$ nvcc -V
bash: nvcc: command not found

$ apt-get update
$ apt-get install -y python3 python3-pip
$ pip3 install torch torchvision
$ python3
Python 3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_available())
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:88: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
False

# Ubuntu cuda11.6
$ docker run --gpus all -it --rm nvidia/cuda:11.6.1-cudnn8-runtime-ubuntu20.04 /bin/bash
$ nvidia-smi
$ nvcc -V
bash: nvcc: command not found


# Redhat cuda11.6
$ docker run --gpus all -it --rm nvidia/cuda:11.6.1-cudnn8-devel-ubi8 /bin/bash
$ nvidia-smi
$ nvcc -V
$ yum install python38
$ curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
$ python3 get-pip.py
$ pip install torch torchvision
$ python3

Python 3.8.12 (default, Sep 16 2021, 10:46:05) 
[GCC 8.5.0 20210514 (Red Hat 8.5.0-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_available())
True
>>>

参考

答案1

您使用的是 Ubuntu 和 Redhat 之间的不同镜像。在 Redhat 上，您使用的是镜像devel（参见镜像名称），其中包括开发工具，例如nvcc。后者未包含在镜像中runtime，这就是您收到“未找到命令”错误的原因。我认为nvidia container toolkit也应该安装在主机上。

环境

命令

参考

答案1

相关内容