无法使用 --gpu all 运行 docker 容器

无法使用 --gpu all 运行 docker 容器

我正在尝试使用选项运行docker容器--gpu all。 它给了我错误:

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
ERRO[0001] error waiting for container:

正在发挥作用

  • 我在使用 torch 和 conda 时也遇到了问题,我首先通过重新安装 cuda 和 torch 解决了这个问题,最后它终于正常工作了,所以我猜测 docker 问题与 GPU 驱动程序无关。

我尝试过

  • 按照此github 问题,我试图将我的 gpu 置于持久模式。
    • 第一个答案之后这里,我试图编辑/lib/systemd/system/nvidia-persistenced.service文件,但没有改变任何东西。
    • 尝试另一种方式nvidia 文档
nvidia-persistenced failed to initialize. Check syslog for more details.

~$ sudo -i
root@PORT-BONNAV-l:~# nvidia-persistenced --user h.bonnavaud
nvidia-persistenced failed to initialize. Check syslog for more details.
root@PORT-BONNAV-l:~# nvidia-smi -pm 1
Persistence mode is already Enabled for GPU 00000000:01:00.0.
All done.
root@PORT-BONNAV-l:~# logout

看起来它已经处于持久模式,所以问题仍然存在。

  • 我检查了 nvidia-persistenced 状态
~$ sudo systemctl status nvidia-persistenced
● nvidia-persistenced.service - NVIDIA Persistence Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/nvidia-persistenced.service.d
             └─override.conf
     Active: active (running) since Wed 2023-09-27 14:30:25 CEST; 7min ago
    Process: 12354 ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --persistence-mode --verbose (code=exited, status=0/SUCCESS)
   Main PID: 12355 (nvidia-persiste)
      Tasks: 1 (limit: 38097)
     Memory: 304.0K
     CGroup: /system.slice/nvidia-persistenced.service
             └─12355 /usr/bin/nvidia-persistenced --user nvidia-persistenced --persistence-mode --verbose

Sep 27 14:30:25 PORT-BONNAV-l systemd[1]: Starting NVIDIA Persistence Daemon...
Sep 27 14:30:25 PORT-BONNAV-l nvidia-persistenced[12355]: Verbose syslog connection opened
Sep 27 14:30:25 PORT-BONNAV-l nvidia-persistenced[12355]: Now running with user ID 129 and group ID 137
Sep 27 14:30:25 PORT-BONNAV-l nvidia-persistenced[12355]: Started (12355)
Sep 27 14:30:25 PORT-BONNAV-l nvidia-persistenced[12355]: device 0000:01:00.0 - registered
Sep 27 14:30:25 PORT-BONNAV-l nvidia-persistenced[12355]: device 0000:01:00.0 - persistence mode enabled.
Sep 27 14:30:25 PORT-BONNAV-l nvidia-persistenced[12355]: device 0000:01:00.0 - NUMA memory onlined.
Sep 27 14:30:25 PORT-BONNAV-l nvidia-persistenced[12355]: Local RPC services initialized
Sep 27 14:30:25 PORT-BONNAV-l systemd[1]: Started NVIDIA Persistence Daemon.

当前状态

  • 看起来这不是 GPU 驱动程序问题。
  • 看起来这不是一个持久模式的问题。
  • 我可以不带--gpu all选项运行docker
  • 我可以在 docker 之外使用我的 GPU(通过 cuda 使用 torch)这个问题可能从何而来?

系统详细信息

  • Ubuntu 20.04
  • sudo lshw -C 显示输出
~$ sudo lshw -C display
  *-display                 
       description: 3D controller
       product: NVIDIA Corporation
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:01:00.0
       logical name: /dev/fb0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress bus_master cap_list rom fb
       configuration: depth=32 driver=nvidia latency=0 mode=1920x1080 visual=truecolor xres=1920 yres=1080
       resources: iomemory:600-5ff iomemory:600-5ff irq:185 memory:a3000000-a3ffffff memory:6000000000-600fffffff memory:6010000000-6011ffffff ioport:3000(size=128)
  *-display
       description: VGA compatible controller
       product: Intel Corporation
       vendor: Intel Corporation
       physical id: 2
       bus info: pci@0000:00:02.0
       version: 01
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress msi pm vga_controller bus_master cap_list rom
       configuration: driver=i915 latency=0
       resources: iomemory:600-5ff iomemory:400-3ff irq:184 memory:6072000000-6072ffffff memory:4000000000-400fffffff ioport:4000(size=64) memory:c0000-dffff memory:4010000000-4016ffffff memory:4020000000-40ffffffff
  • nvidia-smi 输出
~$ nvidia-smi
Wed Sep 27 14:53:26 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.43.04    Driver Version: 515.43.04    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA T600 Lap...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   54C    P8    N/A /  N/A |    603MiB /  4096MiB |      6%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2498      G   /usr/lib/xorg/Xorg                 86MiB |
|    0   N/A  N/A      3590      G   /usr/lib/xorg/Xorg                203MiB |
|    0   N/A  N/A      3785      G   /usr/bin/gnome-shell               92MiB |
|    0   N/A  N/A      5163      G   /usr/lib/firefox/firefox          145MiB |
|    0   N/A  N/A      7384      G   ...b/thunderbird/thunderbird       67MiB |
+-----------------------------------------------------------------------------+
  • sudo lshw -c video | grep'配置'输出
~$ sudo lshw -c video | grep 'configuration'
configuration: depth=32 driver=nvidia latency=0 mode=1920x1080 visual=truecolor xres=1920 yres=1080
configuration: driver=i915 latency=0

提前感谢您的任何帮助/提示/链接。

答案1

使用以下方式安装 NVIDIA 驱动程序时,通常会出现此问题Ubuntu 的ubuntu-drivers install工具。要解决此问题,您可能需要重新安装驱动程序。首先,需要卸载现有驱动程序(特别是针对 Ubuntu,对于其他发行版,请检查这里):

sudo apt-get --purge remove "*cuda*" "*cublas*" "*cufft*" "*cufile*" "*curand*" \
 "*cusolver*" "*cusparse*" "*gds-tools*" "*npp*" "*nvjpeg*" "nsight*" "*nvvm*"
sudo apt-get --purge remove "*nvidia*" "libxnvctrl*"
sudo apt-get autoremove

之后,强烈建议使用包管理器apt重新安装驱动程序。以下是说明(仍然适用于 Ubuntu 22.04,请查看这里对于更多平台):

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
# To install the legacy kernel module flavor
sudo apt-get install -y cuda-drivers
# To install the open kernel module flavor of specific version
# sudo apt-get install -y nvidia-driver-550-open

Docker 支持

请注意NVIDIA 容器工具包也已被先前的apt-get --purge命令卸载。您可以按照这些步骤重新安装它。

对于 Ubuntu Server 版本

最好切换到氢能资源您的服务器的内核:

sudo apt-get install --install-recommends linux-generic-hwe-22.04

驱动程序还会默认为您安装 x11 组件。如果不需要桌面,您可以安装无头的驱动程序版本:

sudo apt-get install nvidia-headless-550

相关内容