2080 ti 卡上的 NVLINK 行为异常吗？

2024-6-1 • tag-icon

我在使用 nvlink 的 RTX 显卡时遇到了问题，我想知道是否有人对这项技术更有经验，可以看看下面的输出并告诉我是否存在问题？

使用一对 MSI 2080 ti 卡和华硕的 RTX NVLINK 桥接器、Ryzen/X370 系统，运行 Ubuntu 18.04 Linux 和多个版本的 Nvidia 驱动程序。

Nvidia-smi 调用运行非常慢，caffe 和 CUDA 示例程序行为异常。

像 Caffe 这样的程序在两个 GPU 上运行时表现不佳（即使用 caffe --gpu 0,1）。设置和搭建可能需要 20 分钟才能完成（对于 googlenet 来说，在单个 GPU 上只需几秒钟即可完成），然后训练有时会按预期方式进行，或者在几次迭代后冻结。

我看到以下输出，感觉不对。我错了吗？

这个输出很奇怪吗？还是我的理解不正确？非常感谢您的帮助！

我正在我的用户帐户 ID 下运行 nvidia-persistenced 作为守护进程。

细节...

$ nvidia-smi -L    # Takes over a minute to finish running.
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-dd1093e0-466f-7322-e214-351b015045d9)
GPU 1: GeForce RTX 2080 Ti (UUID: GPU-2a386612-018c-e3fe-3fd4-1dde588af45d)

$ nvidia-smi nvlink --status    # Takes over a minute to finish running.
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-dd1093e0-466f-7322-e214-351b015045d9)
         Link 0: 25.781 GB/s
         Link 1: <inactive>
GPU 1: GeForce RTX 2080 Ti (UUID: GPU-2a386612-018c-e3fe-3fd4-1dde588af45d)
         Link 0: 25.781 GB/s
         Link 1: <inactive>

两个链接（0 和 1）难道不应该都处于活动状态吗？

$ nvidia-smi nvlink -c     # Takes several minutes to finish running.
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-dd1093e0-466f-7322-e214-351b015045d9)
         Link 0, P2P is supported: true
         Link 0, Access to system memory supported: true
         Link 0, P2P atomics supported: true
         Link 0, System memory atomics supported: true
         Link 0, SLI is supported: true
         Link 0, Link is supported: false
GPU 1: GeForce RTX 2080 Ti (UUID: GPU-2a386612-018c-e3fe-3fd4-1dde588af45d)
         Link 0, P2P is supported: true
         Link 0, Access to system memory supported: true
         Link 0, P2P atomics supported: true
         Link 0, System memory atomics supported: true
         Link 0, SLI is supported: true
         Link 0, Link is supported: false

这里不应该同时有链接 0 和链接 1 吗？

$ nvidia-smi nvlink --capabilities     # Takes several minutes to finish running.
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-dd1093e0-466f-7322-e214-351b015045d9)
         Link 0, P2P is supported: true
         Link 0, Access to system memory supported: true
         Link 0, P2P atomics supported: true
         Link 0, System memory atomics supported: true
         Link 0, SLI is supported: true
         Link 0, Link is supported: false
GPU 1: GeForce RTX 2080 Ti (UUID: GPU-2a386612-018c-e3fe-3fd4-1dde588af45d)
         Link 0, P2P is supported: true
         Link 0, Access to system memory supported: true
         Link 0, P2P atomics supported: true
         Link 0, System memory atomics supported: true
         Link 0, SLI is supported: true
         Link 0, Link is supported: false

这里不应该同时有链接 0 和链接 1 吗？

$ nvidia-smi topo --matrix    # Takes over a minute to finish running.

        GPU0    GPU1    CPU Affinity
GPU0     X      NV1     0-11
GPU1    NV1      X      0-11

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

在这里我们应该看到 NV2 链接（即 2080ti 桥有一对 nvlink‘链接’）？

$ simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
> GPU0 = "GeForce RTX 2080 Ti" IS  capable of Peer-to-Peer (P2P)
> GPU1 = "GeForce RTX 2080 Ti" IS  capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...
> Peer access from GeForce RTX 2080 Ti (GPU0) -> GeForce RTX 2080 Ti (GPU1) : Yes
> Peer access from GeForce RTX 2080 Ti (GPU1) -> GeForce RTX 2080 Ti (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Checking GPU0 and GPU1 for UVA capabilities...
> GeForce RTX 2080 Ti (GPU0) supports UVA: Yes
> GeForce RTX 2080 Ti (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 22.52GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed

据我所知，GPU0 和 GPU1 之间的 cudaMemcpyPeer / cudaMemcpy 应该管理 44 GB/s 左右，而不是 22 GB/s？

$ p2pBandwidthLatencyTest

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, GeForce RTX 2080 Ti, pciBusID: a, pciDeviceID: 0, pciDomainID:0
Device: 1, GeForce RTX 2080 Ti, pciBusID: b, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0       1     1
     1       1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 529.48   3.20
     1   3.19 532.01
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1
     0 531.71  24.23
     1  24.23 530.74
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 533.58   6.30
     1   6.31 526.98
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 525.50  48.37
     1  48.41 523.52
P2P=Disabled Latency Matrix (us)
   GPU     0      1
     0   1.26  12.74
     1  15.19   1.44

   CPU     0      1
     0   3.92   9.03
     1   8.86   3.82
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1
     0   1.25   0.92
     1   0.96   1.44

   CPU     0      1
     0   4.22   2.78
     1   2.78   3.76

但是这里，“双向 P2P=启用带宽矩阵”应该显示 96 GB/s 而不是 48.41。

答案1

我有过非常类似的经历，根本原因是 NVLINK 插入不正确。移除 NVLINK 桥接器后，您可以仔细检查 nvidia-smi 命令输出速度是否符合预期。NVLINK 有两个侧面连接器，应与 GPU 正确接触，否则 nvidia-smi NVLINK --status 命令会显示“不活动”，这可能会导致响应缓慢或挂起。

答案1

相关内容