H100 安装在 HPE DL560 Gen11 服务器上。我正在尝试安装驱动程序,但无法正确识别。写下我迄今为止尝试过的方法。请帮忙
NVIDIA 主页驱动程序(.run 文件)内核模块“nvidia.ko”错误在此处输入图片描述
使用 CUDA(12.3)安装 在此处输入图片描述 使用带驱动程序的 CUDA..但出现相同的“nvidia.ko”错误
- nvidia-installer.log
make[1]: Leaving directory '/usr/src/linux-headers-6.2.0-39-generic'
-> done.
-> Kernel module compilation complete.
ERROR: Unable to load the kernel module 'nvidia.ko'. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.
Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.
-> Kernel module load error: No such device
-> Kernel messages:
[65404.138694] audit: type=1400 audit(1703032481.789:74): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="nvidia_modprobe" pid=226006 comm="apparmor_parser"
[65404.179546] audit: type=1400 audit(1703032481.833:75): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="nvidia_modprobe//kmod" pid=226006 comm="apparmor_parser"
[65404.179565] audit: type=1400 audit(1703032481.833:76): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="libreoffice-xpdfimport" pid=226014 comm="apparmor_parser"
[65404.179727] audit: type=1400 audit(1703032481.833:77): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="libreoffice-senddoc" pid=226012 comm="apparmor_parser"
[65404.179741] audit: type=1400 audit(1703032481.833:78): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="libreoffice-oosplash" pid=226011 comm="apparmor_parser"
[65404.182328] audit: type=1400 audit(1703032481.833:79): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/bin/man" pid=226009 comm="apparmor_parser"
[65412.032963] kauditd_printk_skb: 20 callbacks suppressed
[65412.032967] audit: type=1400 audit(1703032489.686:100): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/bin/evince" pid=226384 comm="apparmor_parser"
[65412.052046] audit: type=1400 audit(1703032489.706:101): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/usr/bin/evince//sanitized_helper" pid=226384 comm="apparmor_parser"
[65412.052495] audit: type=1400 audit(1703032489.706:102): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/evince//snap_browsers" pid=226384 comm="apparmor_parser"
[65412.054707] audit: type=1400 audit(1703032489.706:103): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/usr/bin/evince-previewer" pid=226384 comm="apparmor_parser"
[65412.055108] audit: type=1400 audit(1703032489.706:104): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/usr/bin/evince-previewer//sanitized_helper" pid=226384 comm="apparmor_parser"
[65412.056295] audit: type=1400 audit(1703032489.710:105): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/usr/bin/evince-thumbnailer" pid=226384 comm="apparmor_parser"
[65781.378193] VFIO - User Level meta-driver version: 0.3
[65781.561750] nvidia-nvlink: Nvlink Core is being initialized, major device number 510
[65781.563296] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:1a:00.0)
[65781.563309] nvidia: probe of 0000:1a:00.0 failed with error -1
[65781.563378] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:49:00.0)
[65781.563395] nvidia: probe of 0000:49:00.0 failed with error -1
[65781.563419] NVRM: The NVIDIA probe routine failed for 2 device(s).
[65781.563420] NVRM: None of the NVIDIA devices were initialized.
[65781.563669] nvidia-nvlink: Unregistered Nvlink Core, major device number 510
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
- 使用 ubuntu-drivers 自动安装
似乎有效。但 nvidia-smi 和设置未激活
root@fnhai1-ProLiant-DL560-Gen11:/home/fnh-ai1/Desktop# dpkg -l |grep nvidia
ii libnvidia-cfg1-535:amd64 535.129.03-0ubuntu0.22.04.1 amd64 NVIDIA binary OpenGL/GLX configuration library
ii libnvidia-common-535 535.129.03-0ubuntu0.22.04.1 all Shared files used by the NVIDIA libraries
ii libnvidia-compute-535:amd64 535.129.03-0ubuntu0.22.04.1 amd64 NVIDIA libcompute package
ii libnvidia-compute-535:i386 535.129.03-0ubuntu0.22.04.1 i386 NVIDIA libcompute package
ii libnvidia-decode-535:amd64 535.129.03-0ubuntu0.22.04.1 amd64 NVIDIA Video Decoding runtime libraries
ii libnvidia-decode-535:i386 535.129.03-0ubuntu0.22.04.1 i386 NVIDIA Video Decoding runtime libraries
ii libnvidia-encode-535:amd64 535.129.03-0ubuntu0.22.04.1 amd64 NVENC Video Encoding runtime library
ii libnvidia-encode-535:i386 535.129.03-0ubuntu0.22.04.1 i386 NVENC Video Encoding runtime library
ii libnvidia-extra-535:amd64 535.129.03-0ubuntu0.22.04.1 amd64 Extra libraries for the NVIDIA driver
ii libnvidia-fbc1-535:amd64 535.129.03-0ubuntu0.22.04.1 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-fbc1-535:i386 535.129.03-0ubuntu0.22.04.1 i386 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-gl-535:amd64 535.129.03-0ubuntu0.22.04.1 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii libnvidia-gl-535:i386 535.129.03-0ubuntu0.22.04.1 i386 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii linux-modules-nvidia-535-6.2.0-39-generic 6.2.0-39.40~22.04.1 amd64 Linux kernel nvidia modules for version 6.2.0-39
ii linux-modules-nvidia-535-generic-hwe-22.04 6.2.0-39.40~22.04.1 amd64 Extra drivers for nvidia-535 for the generic-hwe-22.04 flavour
ii linux-objects-nvidia-535-6.2.0-39-generic 6.2.0-39.40~22.04.1 amd64 Linux kernel nvidia modules for version 6.2.0-39 (objects)
ii linux-signatures-nvidia-6.2.0-39-generic 6.2.0-39.40~22.04.1 amd64 Linux kernel signatures for nvidia modules for version 6.2.0-39-generic
ii nvidia-compute-utils-535 535.129.03-0ubuntu0.22.04.1 amd64 NVIDIA compute utilities
ii nvidia-driver-535 535.129.03-0ubuntu0.22.04.1 amd64 NVIDIA driver metapackage
ii nvidia-firmware-535-535.129.03 535.129.03-0ubuntu0.22.04.1 amd64 Firmware files used by the kernel module
ii nvidia-kernel-common-535 535.129.03-0ubuntu0.22.04.1 amd64 Shared files used with the kernel module
ii nvidia-kernel-source-535 535.129.03-0ubuntu0.22.04.1 amd64 NVIDIA kernel source package
ii nvidia-prime 0.8.17.1 all Tools to enable NVIDIA's Prime
ii nvidia-settings 510.47.03-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
ii nvidia-utils-535 535.129.03-0ubuntu0.22.04.1 amd64 NVIDIA driver support binaries
ii screen-resolution-extra 0.18.2 all Extension for the nvidia-settings control panel
ii xserver-xorg-video-nvidia-535 535.129.03-0ubuntu0.22.04.1 amd64 NVIDIA binary Xorg driver
root@fnhai1-ProLiant-DL560-Gen11:/# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
root@fnhai1-ProLiant-DL560-Gen11:~# lshw -C display
*-display
description: VGA compatible controller
product: MGA G200eH3
vendor: Matrox Electronics Systems Ltd.
physical id: 0.1
bus info: pci@0000:01:00.1
logical name: /dev/fb0
version: 03
width: 32 bits
clock: 33MHz
capabilities: pm msi pciexpress vga_controller bus_master cap_list rom fb
configuration: depth=32 driver=mgag200 latency=0 resolution=1024,768
resources: irq:17 memory:98000000-98ffffff memory:99b98000-99b9bfff memory:99000000-997fffff memory:c0000-dffff
*-display UNCLAIMED
description: 3D controller
product: GH100 [H100 PCIe]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:1a:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress msix cap_list
configuration: latency=0
resources: iomemory:20a00-209ff iomemory:20800-207ff iomemory:20a00-209ff
*-display UNCLAIMED
description: 3D controller
product: GH100 [H100 PCIe]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:49:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress msix cap_list
configuration: latency=0
resources: iomemory:22200-221ff iomemory:22000-21fff iomemory:22200-221ff
我执行了一些任务,例如使用 uname -r 重新安装标题、注册 nouveau 黑名单、删除并重新安装所有已安装的驱动程序,但它们都不起作用。
我愿意倾听您提出的任何问题。
答案1
尝试选择另一个版本的Linux核心。
系统启动时,按ESC键,选择“ubuntu高级选项”,例如选择6.2.0-26