为什么两个 Tesla K40c GPU 之间的点对点 (P2P) 访问在 CUDA 中失败?

为什么两个 Tesla K40c GPU 之间的点对点 (P2P) 访问在 CUDA 中失败?

我想使用两台 Tesla K40 设备运行 CUDA C 程序,并在它们之间启用点对点 (P2P),因为我的数据将在设备之间共享。我的电脑有以下设备查询摘要和NVIDIA-smi结果(操作系统:Windows 10)。

设备查询

Device 0: "Tesla K40c"
CUDA Driver Version / Runtime Version          10.2 / 10.2
CUDA Device Driver Mode (TCC or WDDM):         TCC
Device PCI Domain ID / Bus ID / location ID:   0 / 21 / 0

Device 1: "Tesla K40c"
CUDA Driver Version / Runtime Version          10.2 / 10.2
CUDA Device Driver Mode (TCC or WDDM):         TCC
Device PCI Domain ID / Bus ID / location ID:   0 / 45 / 0

Device 2: "Quadro P400"
CUDA Driver Version / Runtime Version          10.2 / 10.2
CUDA Device Driver Mode (TCC or WDDM):         WDDM
Device PCI Domain ID / Bus ID / location ID:   0 / 153 / 0

Nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 441.22       Driver Version: 441.22       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40c          TCC  | 00000000:15:00.0 Off |                    0 |
| 23%   36C    P8    24W / 235W |    809MiB / 11448MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K40c          TCC  | 00000000:2D:00.0 Off |                  Off |
| 23%   43C    P8    24W / 235W |    809MiB / 12215MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Quadro P400        WDDM  | 00000000:99:00.0  On |                  N/A |
| 34%   35C    P8    N/A /  N/A |    449MiB /  2048MiB |     14%      Default |
+-------------------------------+----------------------+----------------------+

我通过以下方式让 Quadro P400 在我的 CUDA 程序中不可见,set CUDA_VISIBLE_DEVICES=0,1然后运行简单P2P示例。P2P 示例成功运行,但结果表明这里存在 P2P 问题。具体来说,尽管 Tesla 设备连接到两个 PCIe3 x16 CPU0 插槽,但 memcpy 速度仅显示约 0.2 GB/s:

Checking for multiple GPUs...
CUDA-capable device count: 2
Checking GPU(s) for support of peer to peer memory access...
> Peer access from Tesla K40c (GPU0) -> Tesla K40c (GPU1) : Yes
> Peer access from Tesla K40c (GPU1) -> Tesla K40c (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 0.19GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed

当我稍微修改代码以更深入地检查 P2P 性能时,SimpleP2P 示例也会失败(请参阅这个问题了解编程细节(如果您认为相关的话)。我所做的测试和一些专家对我的帖子的评论这里表明问题是系统/平台问题。我的主板是 HP 81C7,BIOS 版本为 v02.47(截至 2020 年 4 月 11 日为最新版本)。我也安装了 Nvidia 驱动程序和 CUDA 几次,也尝试过 CUDA 10.1,但都没有成功。有人能告诉我如何深入研究问题并找到问题的根源吗?

相关内容