Docker 镜像中的 CUDA 损坏，需要 sudo apt upgrade

2024-6-11 • tag-icon

问题

为什么sudo apt upgrade需要在主机操作系统上使 CUDA 在 Docker 容器中工作？没有 Docker 就不会出现此问题，但只有在重新创建 Docker 映像时才会出现此问题。

环境

Ubuntu 22.04 LTS
Docker version 26.0.1, build d260a54

Dockerfile

#--------------------------------------------------------------------------------
# Dockerfile to build the base image with requirements and models downloaded.
#
# CUDA 11.7 and Pytorch is 1.13.1 due to the Deepdoctection requirements.
# https://github.com/deepdoctection/deepdoctection#requirements
# Pytorch that satisfies 1.12 <= PyTorch < 2.0 is 1.13.1.
# https://pytorch.org/get-started/previous-versions/#v1130
#--------------------------------------------------------------------------------
FROM nvidia/cuda:11.7.1-devel-ubuntu22.04

# Create working directory
WORKDIR /home/eml

# Copy under code/python
COPY . .

# Note: every run command will create a image layer increaseing the image size.
#--------------------------------------------------------------------------------
# Ubuntu libs and Timezone (https://serverfault.com/q/949991).
#
# [deepdoctection dependency]
# - poppler
# https://pdf2image.readthedocs.io/en/latest/installation.html#installing-poppler
# - tesseract-ocr
# - qpdf for encrypted pdf. See AIML-130.
#--------------------------------------------------------------------------------
ARG DEBIAN_FRONTEND=noninteractive
ENV TZ=Australia/Sydney
RUN apt -y update &&  \
    apt install -y tzdata \
    software-properties-common git cmake wget pkg-config tree ffmpeg libsm6 libxext6  \
    tesseract-ocr libtesseract-dev tesseract-ocr-eng poppler-utils qpdf jq gpustat \
|| exit 1

#--------------------------------------------------------------------------------
# Py3.10 libs
# https://launchpad.net/~deadsnakes/+archive/ubuntu/ppa
# https://askubuntu.com/a/1398569
# https://www.youtube.com/watch?v=Xe40amojaXE
#--------------------------------------------------------------------------------
RUN add-apt-repository --yes ppa:deadsnakes/ppa && \
  apt install -y python3.10 python3-pip build-essential libssl-dev libffi-dev python3-venv \
|| exit 1

#--------------------------------------------------------------------------------
# Pytorch/CUDA
# https://pytorch.org/get-started/previous-versions/#linux-and-windows-9
#--------------------------------------------------------------------------------
RUN pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 \
  --extra-index-url https://download.pytorch.org/whl/cu117

#--------------------------------------------------------------------------------
# Group/User
#--------------------------------------------------------------------------------
#RUN groupadd -g 2000 eml && \
#    useradd -rm -d /home/eml -s /bin/bash -g eml -u 2001 eml && \
#    chown -R eml:eml /home/eml

#--------------------------------------------------------------------------------
# Non root user
# Cause issues e.g.
# - mounted volume access check with os/pathlib does not work.
# - torch.cuda_is_available() becomes False.
# Need research how to use non-root user with file permissions, GPU with non-root
# docker user.
#--------------------------------------------------------------------------------
# USER eml
# ENV PATH="${PATH}:${HOME}/.local/bin"

#--------------------------------------------------------------------------------
# Packages
#--------------------------------------------------------------------------------
RUN pip install -r ./requirements.txt && \
    python3 -m spacy download en_core_web_trf && \
    python3 -m nltk.downloader words && \
    python3 -m nltk.downloader wordnet && \
    huggingface-cli download sentence-transformers/gtr-t5-large \
|| exit 1

#--------------------------------------------------------------------------------
# Run the application
# https://stackoverflow.com/a/46245972/4281353
# >  if you have a docker image where your script is the ENTRYPOINT, any arguments
# > you pass to the docker run command will be added to the entrypoint.
# > ```
# > docker run --rm <yourImageName>  -a API_KEY - f FILENAME -o ORG_ID
# > ```
#--------------------------------------------------------------------------------
# Executable to run by this container is always Python3
ENTRYPOINT ["python3"]

问题

当重新创建 docker 镜像时，Pytorch 无法检测到 CUDA，直到sudo apt upgrade -y重新启动完成。

  File "/usr/local/lib/python3.10/dist-packages/torch/storage.py", line 240, in _load_from_bytes
    return torch.load(io.BytesIO(b))
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 795, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 1012, in _legacy_load
    result = unpickler.load()
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 958, in persistent_load
    wrap_storage=restore_location(obj, location),
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 215, in default_restore_location
    result = fn(storage, location)
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 182, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 166, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

看起来 apt 存储库中的 CUDA 或 NVIDIA 驱动程序更改导致了问题，因为它导致主机操作系统上的 NVIDIA 驱动程序与 docker 容器内的 CUDA 工具包之间不兼容或偏差，但为什么呢？

问题

环境

Dockerfile

问题

相关内容