我在 Python 的 nvidia GPU 上遇到了内存分配极其缓慢的问题。
在新的 Python 会话中运行 GPU 计算时,tensorflow/pytorch 会以微小的增量分配内存,持续大约四分钟,直到突然分配一大块内存并执行实际计算。所有后续计算都会立即执行。
有人知道哪里出了问题吗?或者如何获取内存分配期间实际发生的情况的日志?
我尝试重新安装 CUDA 库和 nvidia 驱动程序。重新安装驱动程序可以暂时解决问题,但内存分配再次挂起。
Python 输出:
Python 3.11.3 (main, Apr 5 2023, 14:15:06) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import timeit
>>> timeit.timeit('import tensorflow as tf;tf.random.uniform([10])', number=1)
2023-04-17 09:08:24.062130: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-17 09:08:24.641429: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-04-17 09:12:12.879503: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21368 MB memory: -> device: 0, name: GRID RTX6000-24Q, pci bus id: 0000:02:02.0, compute capability: 7.5
229.68861908599501
NVIDIA-SMI:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.03 Driver Version: 470.182.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GRID RTX6000-24Q On | 00000000:02:02.0 Off | N/A |
| N/A N/A P8 N/A / N/A | 23527MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 122079 C ...Model-js4zUkog/bin/python 21743MiB |
+-----------------------------------------------------------------------------+
NVCC:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
答案1
我发现内存分配缓慢是由于无法验证许可证而导致 nvidia 限制我的 GPU 造成的。
我检查了:
sudo cat /var/log/syslog | grep nvidia
并发现:
Apr 18 11:35:43 srv-apu102 nvidia-gridd: Valid GRID license not found. GPU features and performance are restricted. To enable full functionality please configure licensing details. Apr 18 11:42:32 srv-apu102 nvidia-gridd: Acquiring license. (Info: http://10.1.2.56:7070/request; NVIDIA RTX Virtual Workstation) Apr 18 11:42:32 srv-apu102 nvidia-gridd: Calling load_byte_array(tra) Apr 18 11:42:35 srv-apu102 nvidia-gridd: Error: Failed server communication. Server URL : http://10.1.2.56:7070/request - #012[1,7e2,2,0[74000008,7,110001f3]] Generic communications error.#012[1,7e2,2,0[75000001,7,30010255]] General data transfer failure. Couldn't connect to server
我希望这可以帮助其他人,无论那些似乎认为我的问题是关于 Ubuntu Lunar 的人投了多少反对票......