GPU(RTX 3090)直通在装有 VMware ESXi 的 Ubuntu 18.04 上不起作用

GPU(RTX 3090)直通在装有 VMware ESXi 的 Ubuntu 18.04 上不起作用

VMware 上的 RTX 2080 或 RTX 3070 没有问题。GPU 直通没问题,但我花了一周时间尝试让我的显卡 (RTX 3090) 在 VMware 主机上运行。

规格:

  • Dell Precision 3650 塔式
  • 酷睿 i9-10900K
  • 48GB 内存
  • NVIDIA GeForce RTX 3090(24GB)
  • VMware ESXi 7.0 U2(vSphere)
  • Ubuntu 18.04(客户操作系统)

我注意到RTX 3090有24GB内存,所以需要EFI启动,并且需要额外的参数。
https://kb.vmware.com/s/article/2142307
https://blogs.vmware.com/apps/2018/10/how-to-enable-nvidia-v100-gpu-in-passthrough-mode-on-vsphere-for-machine-learning-and-other-hpc-workloads.html

附加参数 (VMware)
hypervisor.cpuid.v0=FALSE
pciPassthru.use64bitMMIO=TRUE
pciPassthru.64bitMMIOSizeGB=64

主机 BIOS
memory mapped I/O above 4GB=Enabled

但它没有效果。顺便说一句,显卡在 Windows 上运行良好(但在 VM 上则不行)。

# journalctl -ex
Oct 15 15:45:08 ubuntu kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 242
Oct 15 15:45:08 ubuntu kernel: nvidia 0000:0b:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
Oct 15 15:45:08 ubuntu kernel: NVRM: The NVIDIA GPU 0000:0b:00.0 (PCI ID: 10de:2204)
                               NVRM: installed in this system is not supported by the
                               NVRM: NVIDIA 470.63.01 driver release.
                               NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
                               NVRM: in this release's README, available on the operating system
                               NVRM: specific graphics driver download page at www.nvidia.com.
Oct 15 15:45:08 ubuntu kernel: nvidia: probe of 0000:0b:00.0 failed with error -1
Oct 15 15:45:08 ubuntu kernel: NVRM: The NVIDIA probe routine failed for 1 device(s).
Oct 15 15:45:08 ubuntu kernel: NVRM: None of the NVIDIA devices were initialized.
Oct 15 15:45:08 ubuntu kernel: nvidia-nvlink: Unregistered the Nvlink Core, major device number 242
Oct 15 15:45:08 ubuntu kernel: PKCS#7 signature not signed with a trusted key
# apt-get install nvidia-driver-470
Reading package lists... Done
Building dependency tree
Reading state information... Done
nvidia-driver-470 is already the newest version (470.63.01-0ubuntu0.18.04.2).

# uname -a
Linux ubuntu 4.15.0-159-generic #167-Ubuntu SMP Tue Sep 21 08:55:05 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

# prime-select query
nvidia

# lspci | grep -i nvidia
0b:00.0 VGA compatible controller: NVIDIA Corporation Device 2204 (rev a1)

# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

我不知道为什么我不能使用 RTX 3090。


附加信息:

# grep -E "vmx|svm" /proc/cpuinfo
Nothing is daiplayed.

# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-4.15.0-159-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro

# dmesg | grep -e DMAR
Nothing is daiplayed.

# journalctl -b -k | grep DMAR
Nothing is daiplayed.

# lspci -vvv
0b:00.0 VGA compatible controller: NVIDIA Corporation Device 2204 (rev a1) (prog-if 00 [VGA controller])
        Subsystem: Dell Device 3880
        Physical Slot: 192
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 64, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 19
        Region 0: Memory at fb000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at c0000000 (64-bit, prefetchable) [size=256M]
        Region 3: Memory at d0000000 (64-bit, prefetchable) [size=32M]
        Region 5: I/O ports at 3000 [size=128]
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset- SlotPowerLimit 0.000W
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 5GT/s, Width x32, ASPM L0s, Exit Latency L0s <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s, Width x32, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [b4] Vendor Specific Information: Len=14 <?>
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-
        Capabilities: [250 v1] Latency Tolerance Reporting
                Max snoop latency: 71680ns
                Max no snoop latency: 71680ns
        Capabilities: [258 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=10us
        Capabilities: [128 v1] Power Budgeting <?>
        Capabilities: [420 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr+ BadTLP+ BadDLLP+ Rollover- Timeout+ NonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
        Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900 v1] #19
        Capabilities: [bb0 v1] #15
        Capabilities: [c1c v1] #26
        Capabilities: [d00 v1] #27
        Capabilities: [e00 v1] #25
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

相关内容