我正在尝试在具有 GPU 的服务器上训练用于对象检测的神经网络。我有一个名为 ashrf_py 的环境。使用 Jupyter Notebook、python3、keras 和 TensorFlow 后端。当我进行训练时,即使我没有选择设备编号,也会收到错误。
Epoch 1/40
Exception: Invalid device ordinal value (2). Valid range is [0, 1].
while setting up XLA_GPU_JIT device number 2
你可以告诉我使用这个代码:
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
我已经使用过它。以下是我得到的结果:
2019-04-15 11:20:45.292918: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-04-15 11:20:45.314574: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3696000000 Hz
2019-04-15 11:20:45.316533: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55ded4474170 executing computations on platform Host. Devices:
2019-04-15 11:20:45.316600: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined>
2019-04-15 11:20:45.381103: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-15 11:20:45.385630: I tensorflow/compiler/xla/service/platform_util.cc:194] StreamExecutor cuda device (2) is of insufficient compute capability: 3.5 required, device is 3.0
2019-04-15 11:20:45.478423: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-15 11:20:45.483172: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-15 11:20:45.483940: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55ded3ee4d60 executing computations on platform CUDA. Devices:
2019-04-15 11:20:45.483955: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1
2019-04-15 11:20:45.483963: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (1): GeForce GTX 1060 6GB, Compute Capability 6.1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/media/DeepData/gpu-users/ashraf/.conda/envs/ashraf_py/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1551, in __init__
super(Session, self).__init__(target, graph, config=config)
File "/media/DeepData/gpu-users/ashraf/.conda/envs/ashraf_py/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 676, in __init__
self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Invalid device ordinal value (2). Valid range is [0, 1].
while setting up XLA_GPU_JIT device number 2
显卡规格如下图所示。输出来自“nvidia-smi”
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48 Driver Version: 410.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 106... Off | 00000000:03:00.0 Off | N/A |
| 0% 42C P8 10W / 120W | 161MiB / 6078MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 650... Off | 00000000:04:00.0 N/A | N/A |
| 30% 34C P8 N/A / N/A | 33MiB / 1999MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 1080 Off | 00000000:05:00.0 Off | N/A |
| 0% 45C P8 17W / 210W | 123MiB / 8119MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1935 G /usr/lib/xorg/Xorg 24MiB |
| 0 18504 G /usr/lib/xorg/Xorg 61MiB |
| 0 31427 C ...ashraf/.conda/envs/ashraf_py/bin/python 63MiB |
| 1 Not Supported |
| 2 31427 C ...ashraf/.conda/envs/ashraf_py/bin/python 111MiB |
+-----------------------------------------------------------------------------+
我很感激任何关于这个问题的帮助,即使是简单的解释也会有帮助。非常感谢。
答案1
问题已经解决了!
计算机无法运行基本的 gpu tensorflow 程序“矩阵乘法”。
重点是,我使用的电脑有 3 个 GPU。后来我发现其中一个卡无法正常工作。即使确定哪个 GPU 可以工作也没有任何区别。
移除/断开不工作的 GPU 卡后,程序可以完美运行,并且没有错误。