错误:“无效的设备序数值 (2)。有效范围是 [0, 1]。设置 XLA_GPU_JIT 设备编号 2 时”

错误:“无效的设备序数值 (2)。有效范围是 [0, 1]。设置 XLA_GPU_JIT 设备编号 2 时”

我正在尝试在具有 GPU 的服务器上训练用于对象检测的神经网络。我有一个名为 ashrf_py 的环境。使用 Jupyter Notebook、python3、keras 和 TensorFlow 后端。当我进行训练时,即使我没有选择设备编号,也会收到错误。

Epoch 1/40
Exception: Invalid device ordinal value (2). Valid range is [0, 1].
    while setting up XLA_GPU_JIT device number 2

你可以告诉我使用这个代码:

import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)

我已经使用过它。以下是我得到的结果:

2019-04-15 11:20:45.292918: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-04-15 11:20:45.314574: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3696000000 Hz
2019-04-15 11:20:45.316533: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55ded4474170 executing computations on platform Host. Devices:
2019-04-15 11:20:45.316600: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-04-15 11:20:45.381103: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-15 11:20:45.385630: I tensorflow/compiler/xla/service/platform_util.cc:194] StreamExecutor cuda device (2) is of insufficient compute capability: 3.5 required, device is 3.0
2019-04-15 11:20:45.478423: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-15 11:20:45.483172: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-15 11:20:45.483940: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55ded3ee4d60 executing computations on platform CUDA. Devices:
2019-04-15 11:20:45.483955: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1
2019-04-15 11:20:45.483963: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (1): GeForce GTX 1060 6GB, Compute Capability 6.1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/media/DeepData/gpu-users/ashraf/.conda/envs/ashraf_py/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1551, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/media/DeepData/gpu-users/ashraf/.conda/envs/ashraf_py/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 676, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Invalid device ordinal value (2). Valid range is [0, 1].
    while setting up XLA_GPU_JIT device number 2

显卡规格如下图所示。输出来自“nvidia-smi”

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 00000000:03:00.0 Off |                  N/A |
|  0%   42C    P8    10W / 120W |    161MiB /  6078MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 650...  Off  | 00000000:04:00.0 N/A |                  N/A |
| 30%   34C    P8    N/A /  N/A |     33MiB /  1999MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1080    Off  | 00000000:05:00.0 Off |                  N/A |
|  0%   45C    P8    17W / 210W |    123MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1935      G   /usr/lib/xorg/Xorg                            24MiB |
|    0     18504      G   /usr/lib/xorg/Xorg                            61MiB |
|    0     31427      C   ...ashraf/.conda/envs/ashraf_py/bin/python    63MiB |
|    1                    Not Supported                                       |
|    2     31427      C   ...ashraf/.conda/envs/ashraf_py/bin/python   111MiB |
+-----------------------------------------------------------------------------+

我很感激任何关于这个问题的帮助,即使是简单的解释也会有帮助。非常感谢。

答案1

问题已经解决了!

计算机无法运行基本的 gpu tensorflow 程序“矩阵乘法”。

重点是,我使用的电脑有 3 个 GPU。后来我发现其中一个卡无法正常工作。即使确定哪个 GPU 可以工作也没有任何区别。

移除/断开不工作的 GPU 卡后,程序可以完美运行,并且没有错误。

相关内容