我正在尝试使用选项运行docker容器--gpu all
。 它给了我错误:
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
ERRO[0001] error waiting for container:
正在发挥作用
- 我在使用 torch 和 conda 时也遇到了问题,我首先通过重新安装 cuda 和 torch 解决了这个问题,最后它终于正常工作了,所以我猜测 docker 问题与 GPU 驱动程序无关。
我尝试过
- 按照此github 问题,我试图将我的 gpu 置于持久模式。
nvidia-persistenced failed to initialize. Check syslog for more details.
~$ sudo -i
root@PORT-BONNAV-l:~# nvidia-persistenced --user h.bonnavaud
nvidia-persistenced failed to initialize. Check syslog for more details.
root@PORT-BONNAV-l:~# nvidia-smi -pm 1
Persistence mode is already Enabled for GPU 00000000:01:00.0.
All done.
root@PORT-BONNAV-l:~# logout
看起来它已经处于持久模式,所以问题仍然存在。
- 我检查了 nvidia-persistenced 状态
~$ sudo systemctl status nvidia-persistenced
● nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/nvidia-persistenced.service.d
└─override.conf
Active: active (running) since Wed 2023-09-27 14:30:25 CEST; 7min ago
Process: 12354 ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --persistence-mode --verbose (code=exited, status=0/SUCCESS)
Main PID: 12355 (nvidia-persiste)
Tasks: 1 (limit: 38097)
Memory: 304.0K
CGroup: /system.slice/nvidia-persistenced.service
└─12355 /usr/bin/nvidia-persistenced --user nvidia-persistenced --persistence-mode --verbose
Sep 27 14:30:25 PORT-BONNAV-l systemd[1]: Starting NVIDIA Persistence Daemon...
Sep 27 14:30:25 PORT-BONNAV-l nvidia-persistenced[12355]: Verbose syslog connection opened
Sep 27 14:30:25 PORT-BONNAV-l nvidia-persistenced[12355]: Now running with user ID 129 and group ID 137
Sep 27 14:30:25 PORT-BONNAV-l nvidia-persistenced[12355]: Started (12355)
Sep 27 14:30:25 PORT-BONNAV-l nvidia-persistenced[12355]: device 0000:01:00.0 - registered
Sep 27 14:30:25 PORT-BONNAV-l nvidia-persistenced[12355]: device 0000:01:00.0 - persistence mode enabled.
Sep 27 14:30:25 PORT-BONNAV-l nvidia-persistenced[12355]: device 0000:01:00.0 - NUMA memory onlined.
Sep 27 14:30:25 PORT-BONNAV-l nvidia-persistenced[12355]: Local RPC services initialized
Sep 27 14:30:25 PORT-BONNAV-l systemd[1]: Started NVIDIA Persistence Daemon.
当前状态
- 看起来这不是 GPU 驱动程序问题。
- 看起来这不是一个持久模式的问题。
- 我可以不带
--gpu all
选项运行docker - 我可以在 docker 之外使用我的 GPU(通过 cuda 使用 torch)这个问题可能从何而来?
系统详细信息
- Ubuntu 20.04
- sudo lshw -C 显示输出
~$ sudo lshw -C display
*-display
description: 3D controller
product: NVIDIA Corporation
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:01:00.0
logical name: /dev/fb0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress bus_master cap_list rom fb
configuration: depth=32 driver=nvidia latency=0 mode=1920x1080 visual=truecolor xres=1920 yres=1080
resources: iomemory:600-5ff iomemory:600-5ff irq:185 memory:a3000000-a3ffffff memory:6000000000-600fffffff memory:6010000000-6011ffffff ioport:3000(size=128)
*-display
description: VGA compatible controller
product: Intel Corporation
vendor: Intel Corporation
physical id: 2
bus info: pci@0000:00:02.0
version: 01
width: 64 bits
clock: 33MHz
capabilities: pciexpress msi pm vga_controller bus_master cap_list rom
configuration: driver=i915 latency=0
resources: iomemory:600-5ff iomemory:400-3ff irq:184 memory:6072000000-6072ffffff memory:4000000000-400fffffff ioport:4000(size=64) memory:c0000-dffff memory:4010000000-4016ffffff memory:4020000000-40ffffffff
- nvidia-smi 输出
~$ nvidia-smi
Wed Sep 27 14:53:26 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.43.04 Driver Version: 515.43.04 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA T600 Lap... On | 00000000:01:00.0 Off | N/A |
| N/A 54C P8 N/A / N/A | 603MiB / 4096MiB | 6% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2498 G /usr/lib/xorg/Xorg 86MiB |
| 0 N/A N/A 3590 G /usr/lib/xorg/Xorg 203MiB |
| 0 N/A N/A 3785 G /usr/bin/gnome-shell 92MiB |
| 0 N/A N/A 5163 G /usr/lib/firefox/firefox 145MiB |
| 0 N/A N/A 7384 G ...b/thunderbird/thunderbird 67MiB |
+-----------------------------------------------------------------------------+
- sudo lshw -c video | grep'配置'输出
~$ sudo lshw -c video | grep 'configuration'
configuration: depth=32 driver=nvidia latency=0 mode=1920x1080 visual=truecolor xres=1920 yres=1080
configuration: driver=i915 latency=0
提前感谢您的任何帮助/提示/链接。
答案1
使用以下方式安装 NVIDIA 驱动程序时,通常会出现此问题Ubuntu 的ubuntu-drivers install
工具。要解决此问题,您可能需要重新安装驱动程序。首先,需要卸载现有驱动程序(特别是针对 Ubuntu,对于其他发行版,请检查这里):
sudo apt-get --purge remove "*cuda*" "*cublas*" "*cufft*" "*cufile*" "*curand*" \
"*cusolver*" "*cusparse*" "*gds-tools*" "*npp*" "*nvjpeg*" "nsight*" "*nvvm*"
sudo apt-get --purge remove "*nvidia*" "libxnvctrl*"
sudo apt-get autoremove
之后,强烈建议使用包管理器apt
重新安装驱动程序。以下是说明(仍然适用于 Ubuntu 22.04,请查看这里对于更多平台):
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
# To install the legacy kernel module flavor
sudo apt-get install -y cuda-drivers
# To install the open kernel module flavor of specific version
# sudo apt-get install -y nvidia-driver-550-open
Docker 支持
请注意NVIDIA 容器工具包也已被先前的apt-get --purge
命令卸载。您可以按照这些步骤重新安装它。
对于 Ubuntu Server 版本
最好切换到氢能资源您的服务器的内核:
sudo apt-get install --install-recommends linux-generic-hwe-22.04
驱动程序还会默认为您安装 x11 组件。如果不需要桌面,您可以安装无头的驱动程序版本:
sudo apt-get install nvidia-headless-550