由于某种原因,我只能iommu.passthrough
在我的 arm64 板上启用 iommu。看来我成功地将 Tesla P4 传递到了我的客户机中。
但是安装 Nvidia vgpu 驱动后,客户机上nvidia-smi
报错"No devices were found"
。另外,客户机启动时,dmesg 报告了一些类似arm-smmu-v3 arm-smmu-v3.0.auto: CMD_SYNC timeout at ***
主机上的 smmu 错误(附在本消息末尾)。
我也在主机上安装了Nvidia驱动,发现nvidia-smi可以正确输出GPU信息。
我的主机系统是Ubuntu22.04,并且我尝试了Ubuntu20.04和Ubuntu22.10作为客户系统。
我想知道我的客户机发生了什么?我应该怎么做才能解决这个问题。
root@thinkforce:~# dmesg | tail -50
[ 54.124325] input: OpenBMC virtual_input as /devices/platform/PNP0D10:00/usb1/1-1/1-1.2/1-1.2.4/1-1.2.4:1.0/0003:1D6B:0104.0001/input/input0
[ 54.184912] hid-generic 0003:1D6B:0104.0001: input,hidraw0: USB HID v1.01 Keyboard [OpenBMC virtual_input] on usb-PNP0D10:00-1.2.4/input0
[ 54.185004] input: OpenBMC virtual_input as /devices/platform/PNP0D10:00/usb1/1-1/1-1.2/1-1.2.4/1-1.2.4:1.1/0003:1D6B:0104.0002/input/input1
[ 54.185073] hid-generic 0003:1D6B:0104.0002: input,hidraw1: USB HID v1.01 Mouse [OpenBMC virtual_input] on usb-PNP0D10:00-1.2.4/input1
[ 54.208841] input: OpenBMC virtual_input as /devices/platform/PNP0D10:00/usb1/1-1/1-1.2/1-1.2.4/1-1.2.4:1.2/0003:1D6B:0104.0003/input/input2
[ 54.268850] hid-generic 0003:1D6B:0104.0003: input,hidraw2: USB HID v1.01 Device [OpenBMC virtual_input] on usb-PNP0D10:00-1.2.4/input2
[ 127.924120] kauditd_printk_skb: 38 callbacks suppressed
[ 127.924125] audit: type=1400 audit(1700737029.351:50): apparmor="DENIED" operation="capable" profile="/usr/sbin/cupsd" pid=1170 comm="cupsd" capability=12 capname="net_admin"
[ 128.058433] audit: type=1400 audit(1700737029.484:51): apparmor="DENIED" operation="capable" profile="/usr/sbin/cups-browsed" pid=1239 comm="cups-browsed" capability=23 capname="sys_nice"
[ 128.089347] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[ 128.171676] audit: type=1400 audit(1700737029.600:52): apparmor="STATUS" operation="profile_load" profile="unconfined" name="docker-default" pid=1346 comm="apparmor_parser"
[ 128.219844] Bridge firewalling registered
[ 128.289482] Initializing XFRM netlink socket
[ 139.590270] loop5: detected capacity change from 0 to 8
[ 141.835477] rfkill: input handler disabled
[ 159.003540] systemd-journald[723]: File /var/log/journal/916b199219734366901b8d667f583037/user-1000.journal corrupted or uncleanly shut down, renaming and replacing.
[ 159.136629] rfkill: input handler enabled
[ 160.644572] rfkill: input handler disabled
[ 224.527998] audit: type=1400 audit(1700737125.972:53): apparmor="STATUS" operation="profile_load" profile="unconfined" name="libvirt-b85ad22b-4908-49d6-a080-3d6e61e077b1" pid=3029 comm="apparmor_parser"
[ 224.700728] audit: type=1400 audit(1700737126.144:54): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-b85ad22b-4908-49d6-a080-3d6e61e077b1" pid=3032 comm="apparmor_parser"
[ 224.865398] audit: type=1400 audit(1700737126.308:55): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-b85ad22b-4908-49d6-a080-3d6e61e077b1" pid=3036 comm="apparmor_parser"
[ 225.046268] audit: type=1400 audit(1700737126.488:56): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="libvirt-b85ad22b-4908-49d6-a080-3d6e61e077b1" pid=3040 comm="apparmor_parser"
[ 225.069258] virbr0: port 1(vnet0) entered blocking state
[ 225.069263] virbr0: port 1(vnet0) entered disabled state
[ 225.069329] device vnet0 entered promiscuous mode
[ 225.069529] virbr0: port 1(vnet0) entered blocking state
[ 225.069532] virbr0: port 1(vnet0) entered listening state
[ 225.248074] audit: type=1400 audit(1700737126.692:57): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-b85ad22b-4908-49d6-a080-3d6e61e077b1" pid=3065 comm="apparmor_parser"
[ 226.602689] arm-smmu-v3 arm-smmu-v3.0.auto: CMD_SYNC timeout at 0x00000018 [hwprod 0x00000019, hwcons 0x00000017]
[ 227.094879] virbr0: port 1(vnet0) entered learning state
[ 227.667321] arm-smmu-v3 arm-smmu-v3.0.auto: CMD_SYNC timeout at 0x0000001a [hwprod 0x0000001b, hwcons 0x00000017]
[ 228.731951] arm-smmu-v3 arm-smmu-v3.0.auto: CMD_SYNC timeout at 0x0000001c [hwprod 0x0000001d, hwcons 0x00000017]
[ 229.110812] virbr0: port 1(vnet0) entered forwarding state
[ 229.110816] virbr0: topology change detected, propagating
[ 229.796583] arm-smmu-v3 arm-smmu-v3.0.auto: CMD_SYNC timeout at 0x0000001e [hwprod 0x0000001f, hwcons 0x00000017]
[ 230.861217] arm-smmu-v3 arm-smmu-v3.0.auto: CMD_SYNC timeout at 0x00000021 [hwprod 0x00000022, hwcons 0x00000017]
[ 231.925848] arm-smmu-v3 arm-smmu-v3.0.auto: CMD_SYNC timeout at 0x00000023 [hwprod 0x00000024, hwcons 0x00000017]
[ 232.990480] arm-smmu-v3 arm-smmu-v3.0.auto: CMD_SYNC timeout at 0x00000025 [hwprod 0x00000026, hwcons 0x00000017]
[ 234.055113] arm-smmu-v3 arm-smmu-v3.0.auto: CMD_SYNC timeout at 0x00000027 [hwprod 0x00000028, hwcons 0x00000017]
[ 234.910430] vfio-pci 0000:01:00.0: enabling device (0000 -> 0002)
[ 236.050405] vfio-pci 0000:01:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
[ 238.762895] arm-smmu-v3 arm-smmu-v3.0.auto: CMD_SYNC timeout at 0x000000a8 [hwprod 0x000000a9, hwcons 0x00000017]
[ 239.827524] arm-smmu-v3 arm-smmu-v3.0.auto: CMD_SYNC timeout at 0x000000aa [hwprod 0x000000ab, hwcons 0x00000017]
[ 240.960070] arm-smmu-v3 arm-smmu-v3.0.auto: CMD_SYNC timeout at 0x000000b3 [hwprod 0x000000b4, hwcons 0x00000017]
[ 242.024700] arm-smmu-v3 arm-smmu-v3.0.auto: CMD_SYNC timeout at 0x000000b5 [hwprod 0x000000b6, hwcons 0x00000017]
[ 243.090891] arm-smmu-v3 arm-smmu-v3.0.auto: CMD_SYNC timeout at 0x00000136 [hwprod 0x00000137, hwcons 0x00000017]
[ 244.155521] arm-smmu-v3 arm-smmu-v3.0.auto: CMD_SYNC timeout at 0x00000138 [hwprod 0x00000139, hwcons 0x00000017]
[ 245.220811] arm-smmu-v3 arm-smmu-v3.0.auto: CMD_SYNC timeout at 0x00000149 [hwprod 0x0000014a, hwcons 0x00000017]
[ 246.285441] arm-smmu-v3 arm-smmu-v3.0.auto: CMD_SYNC timeout at 0x0000014b [hwprod 0x0000014c, hwcons 0x00000017]
[ 293.889100] hrtimer: interrupt took 2921 ns
更新 1:
我对在很多教程中都见过的一句话有疑问,那就是:
“IOMMU 组是可以传递给虚拟机的最小物理设备集。”
然而就我而言,您可以看到 PCI 桥接设备和 Nvidia 卡位于同一个 IOMMU 组中:
IOMMU Group 0:
00:00.0 PCI bridge [0604]: Device [1ee5:0100] (rev 01)
01:00.0 3D controller [0302]: NVIDIA Corporation GP104GL [Tesla P4] [10de:1bb3] (rev a1)
IOMMU Group 1:
a0:00.0 PCI bridge [0604]: Device [1ee5:0100] (rev 01)
a1:00.0 PCI bridge [0604]: ASMedia Technology Inc. Device [1b21:1806] (rev 01)
a2:00.0 PCI bridge [0604]: ASMedia Technology Inc. Device [1b21:1806] (rev 01)
a2:06.0 PCI bridge [0604]: ASMedia Technology Inc. Device [1b21:1806] (rev 01)
a2:0e.0 PCI bridge [0604]: ASMedia Technology Inc. Device [1b21:1806] (rev 01)
a3:00.0 Ethernet controller [0200]: Intel Corporation I350 Gigabit Network Connection [8086:1521] (rev 01)
a3:00.1 Ethernet controller [0200]: Intel Corporation I350 Gigabit Network Connection [8086:1521] (rev 01)
a4:00.0 PCI bridge [0604]: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge [1a03:1150] (rev 04)
IOMMU Group 2:
c0:00.0 PCI bridge [0604]: Device [1ee5:0100] (rev 01)
c1:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO [144d:a80a]
这是否意味着我应该将 PCI 桥和显卡都传递到虚拟机?但似乎没有端点设备无法传递到 kvm。我想知道这是否与我通过显卡时遇到的问题有关?