我正在尝试在 Google Kubernetes Engine 中运行使用 GPU 的 Xorg 服务器
我遵循了本指南(https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#ubuntu) 设置带有 Nvidia Tesla T4 GPU 的 GKE 集群。节点基于 Ubuntu 映像(Docker)。
部署了 Pod:
kind: Pod
metadata:
name: my-gpu-pod
spec:
containers:
- name: my-gpu-container
image: nvidia/cuda:10.0-runtime-ubuntu18.04
command: ["/bin/bash", "-c", "--"]
args: ["while true; do sleep 60000; done;"]
securityContext:
privileged: true
resources:
limits:
nvidia.com/gpu: 1
- 根据说明在容器中安装 Nvidia 驱动程序
我可以验证 Nvidia GPU 在容器中可用:
root@my-gpu-pod:/# nvidia-smi
Wed May 26 10:31:43 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 37C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
在容器中安装 Xorg 包
当我启动 Xorg 服务器时,出现此错误:
Fatal server error:
(EE) no screens found(EE)
这是完整的错误
root@my-gpu-pod:/# /usr/bin/Xorg -verbose 3 -novtswitch -keeptty -verbose -allowMouseOpenFail
X.Org X Server 1.19.6
Release Date: 2017-12-20
X Protocol Version 11, Revision 0
Build Operating System: Linux 4.15.0-140-generic x86_64 Ubuntu
Current Operating System: Linux my-gpu-pod 5.4.0-1039-gke #41~18.04.1-Ubuntu SMP Sat Mar 20 14:57:07 UTC 2021 x86_64
Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.4.0-1039-gke root=PARTUUID=e0001179-e649-415c-890d-562fb24bb2eb ro console=ttyS0 net.ifnames=0
Build Date: 08 April 2021 01:57:21PM
xorg-server 2:1.19.6-1ubuntu4.9 (For technical support please see http://www.ubuntu.com/support)
Current version of pixman: 0.34.0
Before reporting problems, check http://wiki.x.org
to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
(++) from command line, (!!) notice, (II) informational,
(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.0.log", Time: Wed May 26 09:47:27 2021
(==) Using config file: "/etc/X11/xorg.conf"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"
(==) ServerLayout "Layout0"
(**) |-->Screen "Screen0" (0)
(**) | |-->Monitor "Monitor0"
(**) | |-->Device "Device0"
(**) |-->Input Device "Keyboard0"
(**) |-->Input Device "Mouse0"
(==) Automatically adding devices
(==) Automatically enabling devices
(==) Automatically adding GPU devices
(==) Automatically binding GPU devices
(==) Max clients allowed: 256, resource mask: 0x1fffff
(WW) The directory "/usr/share/fonts/X11/cyrillic" does not exist.
Entry deleted from font path.
(WW) The directory "/usr/share/fonts/X11/100dpi/" does not exist.
Entry deleted from font path.
(WW) The directory "/usr/share/fonts/X11/75dpi/" does not exist.
Entry deleted from font path.
(WW) The directory "/usr/share/fonts/X11/100dpi" does not exist.
Entry deleted from font path.
(WW) The directory "/usr/share/fonts/X11/75dpi" does not exist.
Entry deleted from font path.
(==) FontPath set to:
/usr/share/fonts/X11/misc,
/usr/share/fonts/X11/Type1,
built-ins
(==) ModulePath set to "/usr/lib/xorg/modules"
(WW) Hotplugging is on, devices using drivers 'kbd', 'mouse' or 'vmmouse' will be disabled.
(WW) Disabling Keyboard0
(WW) Disabling Mouse0
(II) Loader magic: 0x564fdc83b020
(II) Module ABI versions:
X.Org ANSI C Emulation: 0.4
X.Org Video Driver: 23.0
X.Org XInput driver : 24.1
X.Org Server Extension : 10.0
(EE) dbus-core: error connecting to system bus: org.freedesktop.DBus.Error.FileNotFound (Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory)
(--) using VT number 3
(II) xfree86: Adding drm device (/dev/dri/card0)
(**) OutputClass "nvidia" ModulePath extended to "/usr/lib/x86_64-linux-gnu/nvidia/xorg,/usr/lib/xorg/modules"
(--) PCI: (0:0:4:0) 10de:1eb8:10de:12a2 rev 161, Mem @ 0xc0000000/16777216, 0x7c0000000/268435456, 0x7d0000000/33554432
(II) no primary bus or device found
falling back to /sys/devices/pci0000:00/0000:00:04.0/drm/card0
(II) LoadModule: "glx"
(II) Loading /usr/lib/xorg/modules/extensions/libglx.so
(II) Module glx: vendor="X.Org Foundation"
compiled for 1.19.6, module version = 1.0.0
ABI class: X.Org Server Extension, version 10.0
(II) LoadModule: "nvidia"
(II) Loading /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so
(II) Module nvidia: vendor="NVIDIA Corporation"
compiled for 1.6.99.901, module version = 1.0.0
Module class: X.Org Video Driver
(II) NVIDIA dlloader X Driver 465.19.01 Fri Mar 19 07:56:53 UTC 2021
(II) NVIDIA Unified Driver for all Supported NVIDIA GPUs
(II) Loading sub module "fb"
(II) LoadModule: "fb"
(II) Loading /usr/lib/xorg/modules/libfb.so
(II) Module fb: vendor="X.Org Foundation"
compiled for 1.19.6, module version = 1.0.0
ABI class: X.Org ANSI C Emulation, version 0.4
(II) Loading sub module "wfb"
(II) LoadModule: "wfb"
(II) Loading /usr/lib/xorg/modules/libwfb.so
(II) Module wfb: vendor="X.Org Foundation"
compiled for 1.19.6, module version = 1.0.0
ABI class: X.Org ANSI C Emulation, version 0.4
(II) Loading sub module "ramdac"
(II) LoadModule: "ramdac"
(II) Module "ramdac" already built-in
(EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
(EE) NVIDIA: system's kernel log for additional error messages and
(EE) NVIDIA: consult the NVIDIA README for details.
(EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
(EE) NVIDIA: system's kernel log for additional error messages and
(EE) NVIDIA: consult the NVIDIA README for details.
(EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
(EE) NVIDIA: system's kernel log for additional error messages and
(EE) NVIDIA: consult the NVIDIA README for details.
(EE) No devices detected.
(II) Applying OutputClass "nvidia" to /dev/dri/card0
loading driver: nvidia
(==) Matched nvidia as autoconfigured driver 0
(==) Matched nouveau as autoconfigured driver 1
(==) Matched nouveau as autoconfigured driver 2
(==) Matched modesetting as autoconfigured driver 3
(==) Matched fbdev as autoconfigured driver 4
(==) Matched vesa as autoconfigured driver 5
(==) Assigned the driver to the xf86ConfigLayout
(II) LoadModule: "nvidia"
(II) Loading /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so
(II) Module nvidia: vendor="NVIDIA Corporation"
compiled for 1.6.99.901, module version = 1.0.0
Module class: X.Org Video Driver
(II) UnloadModule: "nvidia"
(II) Unloading nvidia
(II) Failed to load module "nvidia" (already loaded, 22095)
(II) LoadModule: "nouveau"
(II) Loading /usr/lib/xorg/modules/drivers/nouveau_drv.so
(II) Module nouveau: vendor="X.Org Foundation"
compiled for 1.19.3, module version = 1.0.15
Module class: X.Org Video Driver
ABI class: X.Org Video Driver, version 23.0
(II) LoadModule: "modesetting"
(II) Loading /usr/lib/xorg/modules/drivers/modesetting_drv.so
(II) Module modesetting: vendor="X.Org Foundation"
compiled for 1.19.6, module version = 1.19.6
Module class: X.Org Video Driver
ABI class: X.Org Video Driver, version 23.0
(II) LoadModule: "fbdev"
(II) Loading /usr/lib/xorg/modules/drivers/fbdev_drv.so
(II) Module fbdev: vendor="X.Org Foundation"
compiled for 1.19.3, module version = 0.4.4
Module class: X.Org Video Driver
ABI class: X.Org Video Driver, version 23.0
(II) LoadModule: "vesa"
(II) Loading /usr/lib/xorg/modules/drivers/vesa_drv.so
(II) Module vesa: vendor="X.Org Foundation"
compiled for 1.19.3, module version = 2.3.4
Module class: X.Org Video Driver
ABI class: X.Org Video Driver, version 23.0
(II) NVIDIA dlloader X Driver 465.19.01 Fri Mar 19 07:56:53 UTC 2021
(II) NVIDIA Unified Driver for all Supported NVIDIA GPUs
(II) NOUVEAU driver Date: Fri Apr 21 14:41:17 2017 -0400
(II) NOUVEAU driver for NVIDIA chipset families :
RIVA TNT (NV04)
RIVA TNT2 (NV05)
GeForce 256 (NV10)
GeForce 2 (NV11, NV15)
GeForce 4MX (NV17, NV18)
GeForce 3 (NV20)
GeForce 4Ti (NV25, NV28)
GeForce FX (NV3x)
GeForce 6 (NV4x)
GeForce 7 (G7x)
GeForce 8 (G8x)
GeForce GTX 200 (NVA0)
GeForce GTX 400 (NVC0)
(II) modesetting: Driver for Modesetting Kernel Drivers: kms
(II) FBDEV: driver for framebuffer: fbdev
(II) VESA: driver for VESA chipsets: vesa
(EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
(EE) NVIDIA: system's kernel log for additional error messages and
(EE) NVIDIA: consult the NVIDIA README for details.
(EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
(EE) NVIDIA: system's kernel log for additional error messages and
(EE) NVIDIA: consult the NVIDIA README for details.
(EE) [drm] Failed to open DRM device for (null): -2
(EE) [drm] Failed to open DRM device for (null): -2
(EE) [drm] Failed to open DRM device for pci:0000:00:04.0: -2
(EE) [drm] Failed to open DRM device for pci:0000:00:04.0: -2
(WW) Falling back to old probe method for modesetting
(WW) Falling back to old probe method for fbdev
(II) Loading sub module "fbdevhw"
(II) LoadModule: "fbdevhw"
(II) Loading /usr/lib/xorg/modules/libfbdevhw.so
(II) Module fbdevhw: vendor="X.Org Foundation"
compiled for 1.19.6, module version = 0.0.2
ABI class: X.Org Video Driver, version 23.0
(EE) open /dev/fb0: No such file or directory
(WW) Falling back to old probe method for vesa
(EE) [drm] Failed to open DRM device for (null): -2
(EE) Screen 0 deleted because of no matching config section.
(II) UnloadModule: "modesetting"
(EE) Device(s) detected, but none match those in the config file.
(EE)
Fatal server error:
(EE) no screens found(EE)
(EE)
Please consult the The X.Org Foundation support
at http://wiki.x.org
for help.
(EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
(EE)
(EE) Server terminated with error (1). Closing log file.
我使用以下命令重新生成了 xorg.conf:nvidia-xconfig但没有区别。
答案1
要在 ubuntu 18.04 中使用无头版 Xorg(即没有显示器),您需要安装虚拟驱动程序。以下是有关这个问题。
总之,安装xserver-xorg-video-dummy
包并将dummy
驱动程序包含在您的xorg.conf
:
Section "Device"
Identifier "Configured Video Device"
Driver "dummy"
EndSection
Section "Monitor"
Identifier "Configured Monitor"
HorizSync 31.5-48.5
VertRefresh 50-70
EndSection
Section "Screen"
Identifier "Default Screen"
Monitor "Configured Monitor"
Device "Configured Video Device"
DefaultDepth 24
SubSection "Display"
Depth 24
Modes "1024x800"
EndSubSection
EndSection