问题
为什么sudo apt upgrade
需要在主机操作系统上使 CUDA 在 Docker 容器中工作?没有 Docker 就不会出现此问题,但只有在重新创建 Docker 映像时才会出现此问题。
环境
Ubuntu 22.04 LTS
Docker version 26.0.1, build d260a54
Dockerfile
#--------------------------------------------------------------------------------
# Dockerfile to build the base image with requirements and models downloaded.
#
# CUDA 11.7 and Pytorch is 1.13.1 due to the Deepdoctection requirements.
# https://github.com/deepdoctection/deepdoctection#requirements
# Pytorch that satisfies 1.12 <= PyTorch < 2.0 is 1.13.1.
# https://pytorch.org/get-started/previous-versions/#v1130
#--------------------------------------------------------------------------------
FROM nvidia/cuda:11.7.1-devel-ubuntu22.04
# Create working directory
WORKDIR /home/eml
# Copy under code/python
COPY . .
# Note: every run command will create a image layer increaseing the image size.
#--------------------------------------------------------------------------------
# Ubuntu libs and Timezone (https://serverfault.com/q/949991).
#
# [deepdoctection dependency]
# - poppler
# https://pdf2image.readthedocs.io/en/latest/installation.html#installing-poppler
# - tesseract-ocr
# - qpdf for encrypted pdf. See AIML-130.
#--------------------------------------------------------------------------------
ARG DEBIAN_FRONTEND=noninteractive
ENV TZ=Australia/Sydney
RUN apt -y update && \
apt install -y tzdata \
software-properties-common git cmake wget pkg-config tree ffmpeg libsm6 libxext6 \
tesseract-ocr libtesseract-dev tesseract-ocr-eng poppler-utils qpdf jq gpustat \
|| exit 1
#--------------------------------------------------------------------------------
# Py3.10 libs
# https://launchpad.net/~deadsnakes/+archive/ubuntu/ppa
# https://askubuntu.com/a/1398569
# https://www.youtube.com/watch?v=Xe40amojaXE
#--------------------------------------------------------------------------------
RUN add-apt-repository --yes ppa:deadsnakes/ppa && \
apt install -y python3.10 python3-pip build-essential libssl-dev libffi-dev python3-venv \
|| exit 1
#--------------------------------------------------------------------------------
# Pytorch/CUDA
# https://pytorch.org/get-started/previous-versions/#linux-and-windows-9
#--------------------------------------------------------------------------------
RUN pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 \
--extra-index-url https://download.pytorch.org/whl/cu117
#--------------------------------------------------------------------------------
# Group/User
#--------------------------------------------------------------------------------
#RUN groupadd -g 2000 eml && \
# useradd -rm -d /home/eml -s /bin/bash -g eml -u 2001 eml && \
# chown -R eml:eml /home/eml
#--------------------------------------------------------------------------------
# Non root user
# Cause issues e.g.
# - mounted volume access check with os/pathlib does not work.
# - torch.cuda_is_available() becomes False.
# Need research how to use non-root user with file permissions, GPU with non-root
# docker user.
#--------------------------------------------------------------------------------
# USER eml
# ENV PATH="${PATH}:${HOME}/.local/bin"
#--------------------------------------------------------------------------------
# Packages
#--------------------------------------------------------------------------------
RUN pip install -r ./requirements.txt && \
python3 -m spacy download en_core_web_trf && \
python3 -m nltk.downloader words && \
python3 -m nltk.downloader wordnet && \
huggingface-cli download sentence-transformers/gtr-t5-large \
|| exit 1
#--------------------------------------------------------------------------------
# Run the application
# https://stackoverflow.com/a/46245972/4281353
# > if you have a docker image where your script is the ENTRYPOINT, any arguments
# > you pass to the docker run command will be added to the entrypoint.
# > ```
# > docker run --rm <yourImageName> -a API_KEY - f FILENAME -o ORG_ID
# > ```
#--------------------------------------------------------------------------------
# Executable to run by this container is always Python3
ENTRYPOINT ["python3"]
问题
当重新创建 docker 镜像时,Pytorch 无法检测到 CUDA,直到sudo apt upgrade -y
重新启动完成。
File "/usr/local/lib/python3.10/dist-packages/torch/storage.py", line 240, in _load_from_bytes
return torch.load(io.BytesIO(b))
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 795, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 1012, in _legacy_load
result = unpickler.load()
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 958, in persistent_load
wrap_storage=restore_location(obj, location),
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 215, in default_restore_location
result = fn(storage, location)
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 182, in _cuda_deserialize
device = validate_cuda_device(location)
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 166, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
看起来 apt 存储库中的 CUDA 或 NVIDIA 驱动程序更改导致了问题,因为它导致主机操作系统上的 NVIDIA 驱动程序与 docker 容器内的 CUDA 工具包之间不兼容或偏差,但为什么呢?