如何持久设置 (NVIDIA) GPU 的 NUMA 节点？

2024-6-5 • tag-icon

我正在运行带有 AMD CPU（EPYC 7H12）和 Nvidia GPU（RTX 3090）的工作站。系统在 Ubuntu 20.04 上运行。在使用 tensorflow 时，我反复收到警告，正如相关内容中所述那么问题来了。

I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

一个回答建议识别 GPU 的 PCI 总线 ID，然后将该设备的 numa_node 设置设置为 0。就我而言，以下方法有效。使用以下方法识别 PCI-ID 后lspci | grep NVIDIA：

# 1) Identify the PCI-ID of the GPU (with domain ID)
#    In my case: PCI_ID="0000.81:00.0"
lspci -D | grep NVIDIA
# 2) Write the NUMA affinity to the device's numa_node file.
echo 0 | sudo tee -a "/sys/bus/pci/devices/<PCI_ID>/numa_node"

然而，这只是一个浅显的修复。首先，每次系统重新启动时，numa_node 设置都会重置（为值 -1）。其次，Nvidia 驱动程序似乎忽略了这个标志，因为nvidia-smi（Nvidia 的驱动程序管理工具）仍然显示：

nvidia-smi topo -m
#
#       GPU0  CPU Affinity    NUMA Affinity
# GPU0     X  0-127           N/A

如何持续指定 GPU 的 NUMA 亲和性？这是 Nvidia 驱动程序、Ubuntu 还是 BIOS 的配置？我知道 Linux 内核是NUMA 感知，但我发现很难找到有关如何配置它的资源。

更新：我以 root 身份添加了一个 crontab，这样可以更持久地修复该问题。但是，修复仍然“肤浅”，因为 Nvidia 驱动程序不知道这一点。

sudo crontab -e
# Add the following line
@reboot (echo 0 | tee -a "/sys/bus/pci/devices/<PCI_ID>/numa_node")