尽管有 4 个 GPU,每个 GPU 都有约 20GB vRAM,但 docker 无法使用以下命令运行。我该如何解决这个问题?
[20:08:28] jalal@echo:~/research/code$ docker run --shm-size 2GB -it --gpus all docurdt/heal
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[0000] error waiting for container: context canceled
[20:08:20] jalal@echo:~/research/code$ nvidia-smi
Fri Apr 1 20:08:28 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 31% 41C P8 23W / 350W | 301MiB / 24576MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:21:00.0 Off | N/A |
| 30% 39C P8 18W / 350W | 14MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce ... Off | 00000000:4A:00.0 Off | N/A |
| 30% 32C P8 23W / 350W | 14MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce ... Off | 00000000:4B:00.0 Off | N/A |
| 30% 40C P8 18W / 350W | 14MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
我也有:
$ uname -a
Linux echo 5.4.0-99-generic #112-Ubuntu SMP Thu Feb 3 13:50:55 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
$ lsb_release -a
LSB Version: core-11.1.0ubuntu2-noarch:security-11.1.0ubuntu2-noarch
Distributor ID: Ubuntu
Description: Ubuntu 20.04.3 LTS
Release: 20.04
Codename: focal
$ docker -v
Docker version 20.10.7, build 20.10.7-0ubuntu5~20.04.2
还,
$ df -h | grep /dev/shm
tmpfs 126G 199M 126G 1% /dev/shm
和
和
$ cat /boot/config-$(uname -r) | grep -i seccomp
CONFIG_SECCOMP=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP_FILTER=y
和
[20:33:17] (dpcc) jalal@echo:~$ lspci -vv | grep -i nvidia
01:00.0 VGA compatible controller: NVIDIA Corporation Device 2204 (rev a1) (prog-if 00 [VGA controller])
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
01:00.1 Audio device: NVIDIA Corporation Device 1aef (rev a1)
21:00.0 VGA compatible controller: NVIDIA Corporation Device 2204 (rev a1) (prog-if 00 [VGA controller])
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
21:00.1 Audio device: NVIDIA Corporation Device 1aef (rev a1)
4a:00.0 VGA compatible controller: NVIDIA Corporation Device 2204 (rev a1) (prog-if 00 [VGA controller])
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
4a:00.1 Audio device: NVIDIA Corporation Device 1aef (rev a1)
4b:00.0 VGA compatible controller: NVIDIA Corporation Device 2204 (rev a1) (prog-if 00 [VGA controller])
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
4b:00.1 Audio device: NVIDIA Corporation Device 1aef (rev a1)
答案1
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
$ sudo apt-get update
$ sudo apt-get install -y nvidia-docker2
$ sudo systemctl restart docker
$ docker run --shm-size 2GB -it --gpus all docurdt/heal (base) root@9f66ed7b7c1b:/Workspace#
非常grym
感谢关联。
答案2
以下三个命令解决了我的问题
sudo apt install -y nvidia-docker2
sudo systemctl daemon-reload
sudo systemctl restart docker
这是来源 https://forums.developer.nvidia.com/t/could-not-select-device-driver-with-capabilities-gpu/80200