Nvidia Config 后 Containerd 启动失败

Nvidia Config 后 Containerd 启动失败

我已经关注这个官方教程允许裸机 k8s 集群具有 GPU 访问权限。但是我在执行此操作时收到错误。

Kubernetes 1.21 containerd 1.4.11 和 Ubuntu 20.04.3 LTS(GNU/Linux 5.4.0-91-generic x86_64)。

Nvidia 驱动程序预装在系统操作系统上,版本为 495 Headless

粘贴以下配置/etc/containerd/config.toml并执行服务重启后,containerd 将无法启动exit 1

容器化配置.toml

systemd 日志这里

# persistent data location
root = "/var/lib/containerd"
# runtime state information
state = "/run/containerd"

# Kubernetes doesn't use containerd restart manager.
disabled_plugins = ["restart"]

# NVIDIA CONFIG START HERE

version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

# NVIDIA CONFIG ENDS HERE

[debug]
  level = ""

[grpc]
  max_recv_message_size = 16777216
  max_send_message_size = 16777216

[plugins.linux]
  shim = "/usr/bin/containerd-shim"
  runtime = "/usr/bin/runc"

我可以确认 Nvidia 驱动程序确实通过运行检测到了 GPU(Nvidia GTX 750Ti)nvidia-smi,并得到了以下输出

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 495.44       CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:02:00.0 Off |                  N/A |
| 34%   34C    P8     1W /  38W |      0MiB /  2000MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

修改的配置.toml让它工作起来。

答案1

据我所知,情况是这样的:

12 月 02 日 03:15:36 k8s-node0 containerd[2179737]: containerd: 无效的禁用插件 URI“重新启动”,需要 io.containerd.x.vx

12 月 02 日 03:15:36 k8s-node0 systemd[1]: containerd.service: 主进程已退出,代码=已退出,状态=1/失败

所以如果你知道-ish插件restart确实已启用,您需要跟踪其新的 URI 语法,但我实际上建议只注释掉该节,或使用disabled_plugins = [],因为容器的 ansible 角色我们使用的没有提到任何关于“重启”的事情,而且确实有= []味道


切线地说,你可能希望限制你journalctl将来的调用只查看containerd.service,因为它会抛出很多分散注意力的文本:journalctl -u containerd.service您甚至可以将其限制为最后几行,这有时可以进一步提供帮助:journalctl -u containerd.service --lines=250

相关内容