Linux 上的多个 GPU 和风扇

Linux 上的多个 GPU 和风扇

我在 Ubuntu 18.04 机器上安装了两块 GTX 1080ti,都是 Founder 版。我主要用它们来训练神经网络。

现在,我主要面临两个问题:

  1. 设置 coolbits(即使使用 --enable-all-gpus)可以让我设置风扇速度和时钟仅适用于连接到显示器的 GPU

  2. 我不想静态设置风扇速度:相反,我想设置一个动态配置文件,%fanspeed 与温度。请注意,在自动模式下,在负载下,一台 1080ti 的温度通常会达到 89-90C,无论节流和机箱是否宽敞......(另一台 1080ti 的温度较低......我认为并非所有 gpu 都是一样的)。

有关我的配置的信息:

inxi -b
System:    Host: nimrod Kernel: 4.15.0-46-generic x86_64 bits: 64
           Desktop: Xfce 4.12.3 Distro: Ubuntu 18.04.2 LTS
Machine:   Device: desktop Mobo: FUJITSU model: D3128-B2 v: S26361-D3128-B2 serial: N/A
           UEFI: FUJITSU // American Megatrends v: V4.6.5.4 R1.8.0 for D3128-B2x date: 06/28/2018
CPU:       10 core Intel Xeon E5-2680 v2 (-MT-MCP-) speed/max: 2269/3600 MHz
Graphics:  Card-1: Advanced Micro Devices [AMD/ATI] Park [Mobility Radeon HD 5430]
           Card-2: NVIDIA GP102 [GeForce GTX 1080 Ti]
           Card-3: NVIDIA GP102 [GeForce GTX 1080 Ti]
           Display Server: x11 (X.Org 1.19.6 )
           drivers: modesetting,nvidia,ati,radeon,nouveau (unloaded: fbdev,vesa)
           Resolution: [email protected]
           OpenGL: renderer: GeForce GTX 1080 Ti/PCIe/SSE2
           version: 4.6.0 NVIDIA 415.27
Network:   Card: Intel 82579LM Gigabit Network Connection (Lewisville)
           driver: e1000e
Drives:    HDD Total Size: 2262.5GB (9.5% used)
Info:      Processes: 413 Uptime: 10 min Memory: 3677.2/96560.4MB
           Client: Shell (bash) inxi: 2.3.56 

Nvidia-smi:

Mon Mar 25 04:19:30 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 415.27       Driver Version: 415.27       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
| 23%   39C    P8    10W / 250W |      2MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:04:00.0  On |                  N/A |
| 31%   57C    P0    69W / 250W |    204MiB / 11176MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    1      1465      G   /usr/lib/xorg/Xorg                           201MiB |
+-----------------------------------------------------------------------------+

最后是我的 xorg.conf

# nvidia-xconfig: X configuration file generated by nvidia-xconfig
# nvidia-xconfig:  version 415.27

Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0"
    Screen      1  "Screen1" RightOf "Screen0"
    InputDevice    "Keyboard0" "CoreKeyboard"
    InputDevice    "Mouse0" "CorePointer"
EndSection

Section "Files"
EndSection

Section "InputDevice"
    # generated from default
    Identifier     "Mouse0"
    Driver         "mouse"
    Option         "Protocol" "auto"
    Option         "Device" "/dev/psaux"
    Option         "Emulate3Buttons" "no"
    Option         "ZAxisMapping" "4 5"
EndSection

Section "InputDevice"
    # generated from default
    Identifier     "Keyboard0"
    Driver         "kbd"
EndSection

Section "Monitor"
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync       28.0 - 33.0
    VertRefresh     43.0 - 72.0
    Option         "DPMS"
EndSection

Section "Monitor"
    Identifier     "Monitor1"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync       28.0 - 33.0
    VertRefresh     43.0 - 72.0
    Option         "DPMS"
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "GeForce GTX 1080 Ti"
    BusID          "PCI:3:0:0"
EndSection

Section "Device"
    Identifier     "Device1"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "GeForce GTX 1080 Ti"
    BusID          "PCI:4:0:0"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    Option         "AllowEmptyInitialConfiguration" "True"
    Option         "Coolbits" "31"
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

Section "Screen"
    Identifier     "Screen1"
    Device         "Device1"
    Monitor        "Monitor1"
    DefaultDepth    24
    Option         "AllowEmptyInitialConfiguration" "True"
    Option         "Coolbits" "31"
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

请注意,它们两者都设置了 coolbits。

你能帮助我吗?

谢谢! :)

答案1

上周也遇到了同样的情况。这是驱动程序的问题。尝试 390​​ 或 430 版本,这两个版本我确认可以在 arch 上正常工作,配有两个 1080ti。

问题很难定位,一开始以为是主板不支持SLI,于是换了块主板开SLI,然后就能调两张显卡的风扇转速了。但是开SLI的时候两张显卡的内存是两张显卡的内存一样,SLI会让batch size变小,这很不可接受。然后我关掉SLI,两张显卡的风扇转速又调不上了。于是换了nvidia的驱动,结果就正常了。该死的nvidia,我换了块主板,把第一块主板的LGA底座弄坏了,然后因为底座坏了烧了一台i5-9400f。我知道是我粗心,但是要不是nvidia驱动有bug,我也不至于受这么多苦。(无稽之谈)

相关内容