Ubuntu 22.04.03 NVIDIA H100 驱动程序无法正常工作

Ubuntu 22.04.03 NVIDIA H100 驱动程序无法正常工作

H100 安装在 HPE DL560 Gen11 服务器上。我正在尝试安装驱动程序,但无法正确识别。写下我迄今为止尝试过的方法。请帮忙

  1. NVIDIA 主页驱动程序(.run 文件)内核模块“nvidia.ko”错误在此处输入图片描述

  2. 使用 CUDA(12.3)安装 在此处输入图片描述 使用带驱动程序的 CUDA..但出现相同的“nvidia.ko”错误


 - nvidia-installer.log

 make[1]: Leaving directory '/usr/src/linux-headers-6.2.0-39-generic'
-> done.
-> Kernel module compilation complete.
ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.

Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.
-> Kernel module load error: No such device
-> Kernel messages:
[65404.138694] audit: type=1400 audit(1703032481.789:74): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="nvidia_modprobe" pid=226006 comm="apparmor_parser"
[65404.179546] audit: type=1400 audit(1703032481.833:75): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="nvidia_modprobe//kmod" pid=226006 comm="apparmor_parser"
[65404.179565] audit: type=1400 audit(1703032481.833:76): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="libreoffice-xpdfimport" pid=226014 comm="apparmor_parser"
[65404.179727] audit: type=1400 audit(1703032481.833:77): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="libreoffice-senddoc" pid=226012 comm="apparmor_parser"
[65404.179741] audit: type=1400 audit(1703032481.833:78): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="libreoffice-oosplash" pid=226011 comm="apparmor_parser"
[65404.182328] audit: type=1400 audit(1703032481.833:79): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/bin/man" pid=226009 comm="apparmor_parser"
[65412.032963] kauditd_printk_skb: 20 callbacks suppressed
[65412.032967] audit: type=1400 audit(1703032489.686:100): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/bin/evince" pid=226384 comm="apparmor_parser"
[65412.052046] audit: type=1400 audit(1703032489.706:101): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/usr/bin/evince//sanitized_helper" pid=226384 comm="apparmor_parser"
[65412.052495] audit: type=1400 audit(1703032489.706:102): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/evince//snap_browsers" pid=226384 comm="apparmor_parser"
[65412.054707] audit: type=1400 audit(1703032489.706:103): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/usr/bin/evince-previewer" pid=226384 comm="apparmor_parser"
[65412.055108] audit: type=1400 audit(1703032489.706:104): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/usr/bin/evince-previewer//sanitized_helper" pid=226384 comm="apparmor_parser"
[65412.056295] audit: type=1400 audit(1703032489.710:105): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/usr/bin/evince-thumbnailer" pid=226384 comm="apparmor_parser"
[65781.378193] VFIO - User Level meta-driver version: 0.3
[65781.561750] nvidia-nvlink: Nvlink Core is being initialized, major device number 510

[65781.563296] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR0 is 0M @ 0x0 (PCI:0000:1a:00.0)
[65781.563309] nvidia: probe of 0000:1a:00.0 failed with error -1
[65781.563378] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR0 is 0M @ 0x0 (PCI:0000:49:00.0)
[65781.563395] nvidia: probe of 0000:49:00.0 failed with error -1
[65781.563419] NVRM: The NVIDIA probe routine failed for 2 device(s).
[65781.563420] NVRM: None of the NVIDIA devices were initialized.
[65781.563669] nvidia-nvlink: Unregistered Nvlink Core, major device number 510
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

  1. 使用 ubuntu-drivers 自动安装

似乎有效。但 nvidia-smi 和设置未激活


root@fnhai1-ProLiant-DL560-Gen11:/home/fnh-ai1/Desktop# dpkg -l |grep nvidia
ii  libnvidia-cfg1-535:amd64                   535.129.03-0ubuntu0.22.04.1             amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-common-535                       535.129.03-0ubuntu0.22.04.1             all          Shared files used by the NVIDIA libraries
ii  libnvidia-compute-535:amd64                535.129.03-0ubuntu0.22.04.1             amd64        NVIDIA libcompute package
ii  libnvidia-compute-535:i386                 535.129.03-0ubuntu0.22.04.1             i386         NVIDIA libcompute package
ii  libnvidia-decode-535:amd64                 535.129.03-0ubuntu0.22.04.1             amd64        NVIDIA Video Decoding runtime libraries
ii  libnvidia-decode-535:i386                  535.129.03-0ubuntu0.22.04.1             i386         NVIDIA Video Decoding runtime libraries
ii  libnvidia-encode-535:amd64                 535.129.03-0ubuntu0.22.04.1             amd64        NVENC Video Encoding runtime library
ii  libnvidia-encode-535:i386                  535.129.03-0ubuntu0.22.04.1             i386         NVENC Video Encoding runtime library
ii  libnvidia-extra-535:amd64                  535.129.03-0ubuntu0.22.04.1             amd64        Extra libraries for the NVIDIA driver
ii  libnvidia-fbc1-535:amd64                   535.129.03-0ubuntu0.22.04.1             amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-fbc1-535:i386                    535.129.03-0ubuntu0.22.04.1             i386         NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-gl-535:amd64                     535.129.03-0ubuntu0.22.04.1             amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  libnvidia-gl-535:i386                      535.129.03-0ubuntu0.22.04.1             i386         NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  linux-modules-nvidia-535-6.2.0-39-generic  6.2.0-39.40~22.04.1                     amd64        Linux kernel nvidia modules for version 6.2.0-39
ii  linux-modules-nvidia-535-generic-hwe-22.04 6.2.0-39.40~22.04.1                     amd64        Extra drivers for nvidia-535 for the generic-hwe-22.04 flavour
ii  linux-objects-nvidia-535-6.2.0-39-generic  6.2.0-39.40~22.04.1                     amd64        Linux kernel nvidia modules for version 6.2.0-39 (objects)
ii  linux-signatures-nvidia-6.2.0-39-generic   6.2.0-39.40~22.04.1                     amd64        Linux kernel signatures for nvidia modules for version 6.2.0-39-generic
ii  nvidia-compute-utils-535                   535.129.03-0ubuntu0.22.04.1             amd64        NVIDIA compute utilities
ii  nvidia-driver-535                          535.129.03-0ubuntu0.22.04.1             amd64        NVIDIA driver metapackage
ii  nvidia-firmware-535-535.129.03             535.129.03-0ubuntu0.22.04.1             amd64        Firmware files used by the kernel module
ii  nvidia-kernel-common-535                   535.129.03-0ubuntu0.22.04.1             amd64        Shared files used with the kernel module
ii  nvidia-kernel-source-535                   535.129.03-0ubuntu0.22.04.1             amd64        NVIDIA kernel source package
ii  nvidia-prime                               0.8.17.1                                all          Tools to enable NVIDIA's Prime
ii  nvidia-settings                            510.47.03-0ubuntu1                      amd64        Tool for configuring the NVIDIA graphics driver
ii  nvidia-utils-535                           535.129.03-0ubuntu0.22.04.1             amd64        NVIDIA driver support binaries
ii  screen-resolution-extra                    0.18.2                                  all          Extension for the nvidia-settings control panel
ii  xserver-xorg-video-nvidia-535              535.129.03-0ubuntu0.22.04.1             amd64        NVIDIA binary Xorg driver

root@fnhai1-ProLiant-DL560-Gen11:/# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

root@fnhai1-ProLiant-DL560-Gen11:~# lshw -C display
  *-display                 
       description: VGA compatible controller
       product: MGA G200eH3
       vendor: Matrox Electronics Systems Ltd.
       physical id: 0.1
       bus info: pci@0000:01:00.1
       logical name: /dev/fb0
       version: 03
       width: 32 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom fb
       configuration: depth=32 driver=mgag200 latency=0 resolution=1024,768
       resources: irq:17 memory:98000000-98ffffff memory:99b98000-99b9bfff memory:99000000-997fffff memory:c0000-dffff
  *-display UNCLAIMED
       description: 3D controller
       product: GH100 [H100 PCIe]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:1a:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress msix cap_list
       configuration: latency=0
       resources: iomemory:20a00-209ff iomemory:20800-207ff iomemory:20a00-209ff
  *-display UNCLAIMED
       description: 3D controller
       product: GH100 [H100 PCIe]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:49:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress msix cap_list
       configuration: latency=0
       resources: iomemory:22200-221ff iomemory:22000-21fff iomemory:22200-221ff

我执行了一些任务,例如使用 uname -r 重新安装标题、注册 nouveau 黑名单、删除并重新安装所有已安装的驱动程序,但它们都不起作用。

我愿意倾听您提出的任何问题。

答案1

尝试选择另一个版本的Linux核心。

系统启动时,按ESC键,选择“ubuntu高级选项”,例如选择6.2.0-26

相关内容