当移除“vfio-pci”并重新连接“nvidial”时,Quadro 卡的奇怪功耗行为

当移除“vfio-pci”并重新连接“nvidial”时,Quadro 卡的奇怪功耗行为

我构建了一个带有 Geforce GTX 960 和 Quadro M4000 显卡的系统,我通常将其连接到虚拟机。 GTX 960卡仅供主机使用。

通常,主机无法使用 Quadro 卡,因为内核驱动程序vfio-pci阻止使用它。但是,当我不在虚拟机中使用它时,我希望可以从主机访问它,例如进行一些计算。

nvidia-setttings但是,功耗和风扇速度存在这种非常奇怪的行为...如何在不需要一直打开的情况下降低功耗和风扇速度?

从我的笔记来看:

在主机上重用直通就绪设备

假设应该在主机上使用已准备好将其传递给来宾的辅助显卡。该设备通常无法在主机上使用,因为加载了错误的驱动程序。此处,Quadro M4000 已vfio-pci使用驱动程序,但nvidia应使用该驱动程序。

sudo lspci -nnk | egrep -A3 "VGA|Display|3D"
  # 0b:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM206 [GeForce GTX 960] [10de:1401] (rev a1)
  # Subsystem: Gigabyte Technology Co., Ltd Device [1458:36ac]
  # Kernel driver in use: nvidia
  # Kernel modules: nouveau, nvidia_drm, nvidia
  # --
  # 0c:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM204GL [Quadro M4000] [10de:13f1] (rev a1)
  # Subsystem: Hewlett-Packard Company Device [103c:1153]
  # Kernel driver in use: vfio-pci
  # Kernel modules: nouveau, nvidia_drm, nvidia

卸载vfio-pci驱动程序并再次检查设备状态。不应使用任何内核驱动程序,因此线路Kernel driver in use: ...消失了。

sudo modprobe -r vfio-pci
sudo lspci -nnk | egrep -A3 "VGA|Display|3D"
  # 0b:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM206 [GeForce GTX 960] [10de:1401] (rev a1)
  # Subsystem: Gigabyte Technology Co., Ltd Device [1458:36ac]
  # Kernel driver in use: nvidia
  # Kernel modules: nouveau, nvidia_drm, nvidia
  # --
  # 0c:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM204GL [Quadro M4000] [10de:13f1] (rev a1)
  # Subsystem: Hewlett-Packard Company Device [103c:1153]
  # Kernel modules: nouveau, nvidia_drm, nvidia
  # 0c:00.1 Audio device [0403]: NVIDIA Corporation GM204 High Definition Audio Controller [10de:0fbb] (rev a1)

还要检查 nvidia 驱动程序工具的输出nvidia-smi。它应该只列出一张显卡(未通过的 GTX 960)。

sudo nvidia-smi 
  # Tue Sep 28 18:19:36 2021       
  # +-----------------------------------------------------------------------------+
  # | NVIDIA-SMI 470.74       Driver Version: 470.74       CUDA Version: 11.4     |
  # |-------------------------------+----------------------+----------------------+
  # | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
  # | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
  # |                               |                      |               MIG M. |
  # |===============================+======================+======================|
  # |   0  NVIDIA GeForce ...  Off  | 00000000:0B:00.0  On |                  N/A |
  # |  0%   51C    P8    19W / 160W |    477MiB /  4040MiB |      0%      Default |
  # |                               |                      |                  N/A |
  # +-------------------------------+----------------------+----------------------+
  # ...

从系统中删除所有关联的 PCI 设备。在本例中,它们是0c:00.00c:00.1。然后检查那些是否真的消失了。

echo 1 | sudo tee /sys/bus/pci/devices/0000\:0c\:00.0/remove
echo 1 | sudo tee /sys/bus/pci/devices/0000\:0c\:00.1/remove
sudo ls /sys/bus/pci/devices/ | grep 0c:00.
  # nothing...

然后让它rescan用于 PCI 设备并检查设备是否再次存在并启用。还要检查正在使用哪个内核驱动程序以及nvidia-smi正在说明什么。

echo 1 | sudo tee /sys/bus/pci/rescan
sudo ls /sys/bus/pci/devices/ | grep 0c:00.
sudo cat /sys/bus/pci/devices/0000\:0c\:00.?/enable
  # 1
  # 1
sudo lspci -nnk | egrep -A3 "VGA|Display|3D"
  # 0b:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM206 [GeForce GTX 960] [10de:1401] (rev a1)
  # Subsystem: Gigabyte Technology Co., Ltd Device [1458:36ac]
  # Kernel driver in use: nvidia
  # Kernel modules: nouveau, nvidia_drm, nvidia
  # --
  # 0c:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM204GL [Quadro M4000] [10de:13f1] (rev a1)
  # Subsystem: Hewlett-Packard Company Device [103c:1153]
  # Kernel driver in use: nvidia      # <-- here!
  # Kernel modules: nouveau, nvidia_drm, nvidia
sudo nvidia-smi 
  # Tue Sep 28 18:26:16 2021       
  # +-----------------------------------------------------------------------------+
  # | NVIDIA-SMI 470.74       Driver Version: 470.74       CUDA Version: 11.4     |
  # |-------------------------------+----------------------+----------------------+
  # | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
  # | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
  # |                               |                      |               MIG M. |
  # |===============================+======================+======================|
  # |   0  NVIDIA GeForce ...  Off  | 00000000:0B:00.0  On |                  N/A |
  # |  0%   47C    P8    19W / 160W |    479MiB /  4040MiB |      0%      Default |
  # |                               |                      |                  N/A |
  # +-------------------------------+----------------------+----------------------+
  # |   1  Quadro M4000        Off  | 00000000:0C:00.0 Off |                  N/A |
  # | 45%   37C    P0    42W / 120W |      0MiB /  8127MiB |      2%      Default |
  # |                               |                      |                  N/A |
  # +-------------------------------+----------------------+----------------------+
  # ...

有趣的是,Quadro M4000 在完全无负载的情况下消耗约 42 瓦。我猜这是由于驱动程序问题......

然而nvidia-settings,如果加载图形程序,则功率需求大概12瓦

# Terminal A
watch -d -n 1 sudo nvidia-smi
# Terminal B
nvidia-settings

nvidia-smi当奇迹发生时,观看并聆听风扇的噪音......

watch -d -n 1 sudo nvidia-smi
  # ...
  # +-------------------------------+----------------------+----------------------+
  # |   1  Quadro M4000        Off  | 00000000:0C:00.0 Off |                  N/A |
  # | 46%   38C    P0    10W / 120W |      0MiB /  8127MiB |      0%      Default |
  # |                               |                      |                  N/A |
  # +-------------------------------+----------------------+----------------------+
  # ...

最重要的是——nvidia-settings甚至没有列出我的 Quadro 卡... nvidia 设置中没有 Quadro 卡

相关内容