我想使用两台 Tesla K40 设备运行 CUDA C 程序,并在它们之间启用点对点 (P2P),因为我的数据将在设备之间共享。我的电脑有以下设备查询摘要和NVIDIA-smi结果(操作系统:Windows 10)。
设备查询:
Device 0: "Tesla K40c"
CUDA Driver Version / Runtime Version 10.2 / 10.2
CUDA Device Driver Mode (TCC or WDDM): TCC
Device PCI Domain ID / Bus ID / location ID: 0 / 21 / 0
Device 1: "Tesla K40c"
CUDA Driver Version / Runtime Version 10.2 / 10.2
CUDA Device Driver Mode (TCC or WDDM): TCC
Device PCI Domain ID / Bus ID / location ID: 0 / 45 / 0
Device 2: "Quadro P400"
CUDA Driver Version / Runtime Version 10.2 / 10.2
CUDA Device Driver Mode (TCC or WDDM): WDDM
Device PCI Domain ID / Bus ID / location ID: 0 / 153 / 0
Nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 441.22 Driver Version: 441.22 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40c TCC | 00000000:15:00.0 Off | 0 |
| 23% 36C P8 24W / 235W | 809MiB / 11448MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K40c TCC | 00000000:2D:00.0 Off | Off |
| 23% 43C P8 24W / 235W | 809MiB / 12215MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Quadro P400 WDDM | 00000000:99:00.0 On | N/A |
| 34% 35C P8 N/A / N/A | 449MiB / 2048MiB | 14% Default |
+-------------------------------+----------------------+----------------------+
我通过以下方式让 Quadro P400 在我的 CUDA 程序中不可见,set CUDA_VISIBLE_DEVICES=0,1
然后运行简单P2P示例。P2P 示例成功运行,但结果表明这里存在 P2P 问题。具体来说,尽管 Tesla 设备连接到两个 PCIe3 x16 CPU0 插槽,但 memcpy 速度仅显示约 0.2 GB/s:
Checking for multiple GPUs...
CUDA-capable device count: 2
Checking GPU(s) for support of peer to peer memory access...
> Peer access from Tesla K40c (GPU0) -> Tesla K40c (GPU1) : Yes
> Peer access from Tesla K40c (GPU1) -> Tesla K40c (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 0.19GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed
当我稍微修改代码以更深入地检查 P2P 性能时,SimpleP2P 示例也会失败(请参阅这个问题了解编程细节(如果您认为相关的话)。我所做的测试和一些专家对我的帖子的评论这里表明问题是系统/平台问题。我的主板是 HP 81C7,BIOS 版本为 v02.47(截至 2020 年 4 月 11 日为最新版本)。我也安装了 Nvidia 驱动程序和 CUDA 几次,也尝试过 CUDA 10.1,但都没有成功。有人能告诉我如何深入研究问题并找到问题的根源吗?