Xorg 未在 GKE 中使用 GPU 启动:(EE)未找到屏幕(EE)

Xorg 未在 GKE 中使用 GPU 启动:(EE)未找到屏幕(EE)

我正在尝试在 Google Kubernetes Engine 中运行使用 GPU 的 Xorg 服务器

  1. 我遵循了本指南(https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#ubuntu) 设置带有 Nvidia Tesla T4 GPU 的 GKE 集群。节点基于 Ubuntu 映像(Docker)。

  2. 部署了 Pod:

kind: Pod
metadata:
  name: my-gpu-pod
spec:
  containers:
  - name: my-gpu-container
    image: nvidia/cuda:10.0-runtime-ubuntu18.04
    command: ["/bin/bash", "-c", "--"]
    args: ["while true; do sleep 60000; done;"]
    securityContext:
      privileged: true
    resources:
      limits:
       nvidia.com/gpu: 1
  1. 根据说明在容器中安装 Nvidia 驱动程序

我可以验证 Nvidia GPU 在容器中可用:

root@my-gpu-pod:/# nvidia-smi 
Wed May 26 10:31:43 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04   Driver Version: 450.102.04   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
  1. 在容器中安装 Xorg 包

  2. 当我启动 Xorg 服务器时,出现此错误:

Fatal server error:
(EE) no screens found(EE)

这是完整的错误

root@my-gpu-pod:/# /usr/bin/Xorg -verbose 3 -novtswitch -keeptty -verbose -allowMouseOpenFail

X.Org X Server 1.19.6
Release Date: 2017-12-20
X Protocol Version 11, Revision 0
Build Operating System: Linux 4.15.0-140-generic x86_64 Ubuntu
Current Operating System: Linux my-gpu-pod 5.4.0-1039-gke #41~18.04.1-Ubuntu SMP Sat Mar 20 14:57:07 UTC 2021 x86_64
Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.4.0-1039-gke root=PARTUUID=e0001179-e649-415c-890d-562fb24bb2eb ro console=ttyS0 net.ifnames=0
Build Date: 08 April 2021  01:57:21PM
xorg-server 2:1.19.6-1ubuntu4.9 (For technical support please see http://www.ubuntu.com/support) 
Current version of pixman: 0.34.0
        Before reporting problems, check http://wiki.x.org
        to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
        (++) from command line, (!!) notice, (II) informational,
        (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.0.log", Time: Wed May 26 09:47:27 2021
(==) Using config file: "/etc/X11/xorg.conf"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"
(==) ServerLayout "Layout0"
(**) |-->Screen "Screen0" (0)
(**) |   |-->Monitor "Monitor0"
(**) |   |-->Device "Device0"
(**) |-->Input Device "Keyboard0"
(**) |-->Input Device "Mouse0"
(==) Automatically adding devices
(==) Automatically enabling devices
(==) Automatically adding GPU devices
(==) Automatically binding GPU devices
(==) Max clients allowed: 256, resource mask: 0x1fffff
(WW) The directory "/usr/share/fonts/X11/cyrillic" does not exist.
        Entry deleted from font path.
(WW) The directory "/usr/share/fonts/X11/100dpi/" does not exist.
        Entry deleted from font path.
(WW) The directory "/usr/share/fonts/X11/75dpi/" does not exist.
        Entry deleted from font path.
(WW) The directory "/usr/share/fonts/X11/100dpi" does not exist.
        Entry deleted from font path.
(WW) The directory "/usr/share/fonts/X11/75dpi" does not exist.
        Entry deleted from font path.
(==) FontPath set to:
        /usr/share/fonts/X11/misc,
        /usr/share/fonts/X11/Type1,
        built-ins
(==) ModulePath set to "/usr/lib/xorg/modules"
(WW) Hotplugging is on, devices using drivers 'kbd', 'mouse' or 'vmmouse' will be disabled.
(WW) Disabling Keyboard0
(WW) Disabling Mouse0
(II) Loader magic: 0x564fdc83b020
(II) Module ABI versions:
        X.Org ANSI C Emulation: 0.4
        X.Org Video Driver: 23.0
        X.Org XInput driver : 24.1
        X.Org Server Extension : 10.0
(EE) dbus-core: error connecting to system bus: org.freedesktop.DBus.Error.FileNotFound (Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory)
(--) using VT number 3

(II) xfree86: Adding drm device (/dev/dri/card0)
(**) OutputClass "nvidia" ModulePath extended to "/usr/lib/x86_64-linux-gnu/nvidia/xorg,/usr/lib/xorg/modules"
(--) PCI: (0:0:4:0) 10de:1eb8:10de:12a2 rev 161, Mem @ 0xc0000000/16777216, 0x7c0000000/268435456, 0x7d0000000/33554432
(II) no primary bus or device found
        falling back to /sys/devices/pci0000:00/0000:00:04.0/drm/card0
(II) LoadModule: "glx"
(II) Loading /usr/lib/xorg/modules/extensions/libglx.so
(II) Module glx: vendor="X.Org Foundation"
        compiled for 1.19.6, module version = 1.0.0
        ABI class: X.Org Server Extension, version 10.0
(II) LoadModule: "nvidia"
(II) Loading /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so
(II) Module nvidia: vendor="NVIDIA Corporation"
        compiled for 1.6.99.901, module version = 1.0.0
        Module class: X.Org Video Driver
(II) NVIDIA dlloader X Driver  465.19.01  Fri Mar 19 07:56:53 UTC 2021
(II) NVIDIA Unified Driver for all Supported NVIDIA GPUs
(II) Loading sub module "fb"
(II) LoadModule: "fb"
(II) Loading /usr/lib/xorg/modules/libfb.so
(II) Module fb: vendor="X.Org Foundation"
        compiled for 1.19.6, module version = 1.0.0
        ABI class: X.Org ANSI C Emulation, version 0.4
(II) Loading sub module "wfb"
(II) LoadModule: "wfb"
(II) Loading /usr/lib/xorg/modules/libwfb.so
(II) Module wfb: vendor="X.Org Foundation"
        compiled for 1.19.6, module version = 1.0.0
        ABI class: X.Org ANSI C Emulation, version 0.4
(II) Loading sub module "ramdac"
(II) LoadModule: "ramdac"
(II) Module "ramdac" already built-in
(EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
(EE) NVIDIA:     system's kernel log for additional error messages and
(EE) NVIDIA:     consult the NVIDIA README for details.
(EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
(EE) NVIDIA:     system's kernel log for additional error messages and
(EE) NVIDIA:     consult the NVIDIA README for details.
(EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
(EE) NVIDIA:     system's kernel log for additional error messages and
(EE) NVIDIA:     consult the NVIDIA README for details.
(EE) No devices detected.
(II) Applying OutputClass "nvidia" to /dev/dri/card0
        loading driver: nvidia
(==) Matched nvidia as autoconfigured driver 0
(==) Matched nouveau as autoconfigured driver 1
(==) Matched nouveau as autoconfigured driver 2
(==) Matched modesetting as autoconfigured driver 3
(==) Matched fbdev as autoconfigured driver 4
(==) Matched vesa as autoconfigured driver 5
(==) Assigned the driver to the xf86ConfigLayout
(II) LoadModule: "nvidia"
(II) Loading /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so
(II) Module nvidia: vendor="NVIDIA Corporation"
        compiled for 1.6.99.901, module version = 1.0.0
        Module class: X.Org Video Driver
(II) UnloadModule: "nvidia"
(II) Unloading nvidia
(II) Failed to load module "nvidia" (already loaded, 22095)
(II) LoadModule: "nouveau"
(II) Loading /usr/lib/xorg/modules/drivers/nouveau_drv.so
(II) Module nouveau: vendor="X.Org Foundation"
        compiled for 1.19.3, module version = 1.0.15
        Module class: X.Org Video Driver
        ABI class: X.Org Video Driver, version 23.0
(II) LoadModule: "modesetting"
(II) Loading /usr/lib/xorg/modules/drivers/modesetting_drv.so
(II) Module modesetting: vendor="X.Org Foundation"
        compiled for 1.19.6, module version = 1.19.6
        Module class: X.Org Video Driver
        ABI class: X.Org Video Driver, version 23.0
(II) LoadModule: "fbdev"
(II) Loading /usr/lib/xorg/modules/drivers/fbdev_drv.so
(II) Module fbdev: vendor="X.Org Foundation"
        compiled for 1.19.3, module version = 0.4.4
        Module class: X.Org Video Driver
        ABI class: X.Org Video Driver, version 23.0
(II) LoadModule: "vesa"
(II) Loading /usr/lib/xorg/modules/drivers/vesa_drv.so
(II) Module vesa: vendor="X.Org Foundation"
        compiled for 1.19.3, module version = 2.3.4
        Module class: X.Org Video Driver
        ABI class: X.Org Video Driver, version 23.0
(II) NVIDIA dlloader X Driver  465.19.01  Fri Mar 19 07:56:53 UTC 2021
(II) NVIDIA Unified Driver for all Supported NVIDIA GPUs
(II) NOUVEAU driver Date:   Fri Apr 21 14:41:17 2017 -0400
(II) NOUVEAU driver for NVIDIA chipset families :
        RIVA TNT        (NV04)
        RIVA TNT2       (NV05)
        GeForce 256     (NV10)
        GeForce 2       (NV11, NV15)
        GeForce 4MX     (NV17, NV18)
        GeForce 3       (NV20)
        GeForce 4Ti     (NV25, NV28)
        GeForce FX      (NV3x)
        GeForce 6       (NV4x)
        GeForce 7       (G7x)
        GeForce 8       (G8x)
        GeForce GTX 200 (NVA0)
        GeForce GTX 400 (NVC0)
(II) modesetting: Driver for Modesetting Kernel Drivers: kms
(II) FBDEV: driver for framebuffer: fbdev
(II) VESA: driver for VESA chipsets: vesa
(EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
(EE) NVIDIA:     system's kernel log for additional error messages and
(EE) NVIDIA:     consult the NVIDIA README for details.
(EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
(EE) NVIDIA:     system's kernel log for additional error messages and
(EE) NVIDIA:     consult the NVIDIA README for details.
(EE) [drm] Failed to open DRM device for (null): -2
(EE) [drm] Failed to open DRM device for (null): -2
(EE) [drm] Failed to open DRM device for pci:0000:00:04.0: -2
(EE) [drm] Failed to open DRM device for pci:0000:00:04.0: -2
(WW) Falling back to old probe method for modesetting
(WW) Falling back to old probe method for fbdev
(II) Loading sub module "fbdevhw"
(II) LoadModule: "fbdevhw"
(II) Loading /usr/lib/xorg/modules/libfbdevhw.so
(II) Module fbdevhw: vendor="X.Org Foundation"
        compiled for 1.19.6, module version = 0.0.2
        ABI class: X.Org Video Driver, version 23.0
(EE) open /dev/fb0: No such file or directory
(WW) Falling back to old probe method for vesa
(EE) [drm] Failed to open DRM device for (null): -2
(EE) Screen 0 deleted because of no matching config section.
(II) UnloadModule: "modesetting"
(EE) Device(s) detected, but none match those in the config file.
(EE) 
Fatal server error:
(EE) no screens found(EE) 
(EE) 
Please consult the The X.Org Foundation support 
         at http://wiki.x.org
 for help. 
(EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
(EE) 
(EE) Server terminated with error (1). Closing log file.

我使用以下命令重新生成了 xorg.conf:nvidia-xconfig但没有区别。

答案1

要在 ubuntu 18.04 中使用无头版 Xorg(即没有显示器),您需要安装虚拟驱动程序。以下是有关这个问题

总之,安装xserver-xorg-video-dummy包并将dummy驱动程序包含在您的xorg.conf

Section "Device"
    Identifier  "Configured Video Device"
    Driver      "dummy"
EndSection

Section "Monitor"
    Identifier  "Configured Monitor"
    HorizSync 31.5-48.5
    VertRefresh 50-70
EndSection

Section "Screen"
    Identifier  "Default Screen"
    Monitor     "Configured Monitor"
    Device      "Configured Video Device"
    DefaultDepth 24
    SubSection "Display"
    Depth 24
    Modes "1024x800"
    EndSubSection
EndSection

相关内容