我有一台计算机,其特点是:
- 系统:Ubuntu 14.04
- 显卡:NVIDIA GTX1080ti
大约一年前,我安装了系统,然后在这台电脑上安装了带有 NVIDIA 驱动程序的 CUDA8.0。GPU 和 CUDA 一直正常工作,直到今天我尝试安装更高版本的 CUDA。
由于某些原因,我尝试安装 CUDA10.0 来替代当前安装的 CUDA8.0。首先,我使用 卸载了旧驱动程序nvidia-uninstall
。然后使用 卸载了旧 CUDA /usr/local/cuda-8.0/bin/uninstall_cuda_8.0.pl
。之后,我使用从 下载的运行文件安装程序安装了 CUDA10.0 和新驱动程序这一页。但是安装失败了。经过多次调试失败后,我放弃了,卸载了新的驱动程序和新的 CUDA,并使用从下载的运行文件安装程序重新安装 CUDA8.0这一页。安装成功。但我无法再启动有关 CUDA 的任何内容,包括和pycuda
。所有这些软件包都报告说它们找不到 GPU 设备。pyopencl
tensorflow
更新:
我尝试过通过 、 和 卸载所有 NVIDIA 组件,但sudo apt-get --purge remove nvidia-*
问题仍然存在,错误报告和系统日志也变得不一样了,以下是当前的系统日志:nvidia-uninstall
uninstall_cuda_8.0.pl
以下是我的一些系统日志:
在 python CLI 中,pycuda
失败:
Python 2.7.6 (default, Nov 23 2017, 15:49:48)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pycuda.driver as cuda
>>> import pycuda.autoinit
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pycuda/autoinit.py", line 5, in <module>
cuda.init()
pycuda._driver.RuntimeError: cuInit failed: no CUDA-capable device is detected
>>>
nvidia-smi
报告:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26 Driver Version: 375.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 ERR! Off | 0000:01:00.0 On | N/A |
| 28% 52C P8 15W / 300W | 43MiB / 11168MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1868 G /usr/lib/xorg/Xorg 40MiB |
+-----------------------------------------------------------------------------+
dmesg | grep nvidia
报告:
[ 2.370841] nvidia: loading out-of-tree module taints kernel.
[ 2.370844] nvidia: module license 'NVIDIA' taints kernel.
[ 2.374116] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 2.380809] nvidia-nvlink: Nvlink Core is being initialized, major device number 242
[ 2.383631] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 375.26 Thu Dec 8 18:04:14 PST 2016
[ 2.385803] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[ 2.717844] init: nvidia-prime main process (1094) terminated with status 127
[ 7.447032] nvidia-modeset: Allocated GPU:0 (GPU-3727ccd9-f1fc-78c9-f908-5e1edf205194) @ PCI:0000:01:00.0
[ 72.737634] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 241
nvidia-smi -a
报告(注意该Product Name
列是Unknown Error
):
==============NVSMI LOG==============
Timestamp : Thu Sep 27 10:16:41 2018
Driver Version : 375.26
Attached GPUs : 1
GPU 0000:01:00.0
Product Name : Unknown Error
Product Brand : GeForce
Display Mode : Enabled
Display Active : Enabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-3727ccd9-f1fc-78c9-f908-5e1edf205194
Minor Number : 0
VBIOS Version : 86.02.40.00.2E
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : N/A
Inforom Version
Image Version : G001.0000.01.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x1B0610DE
Bus Id : 0000:01:00.0
Sub System Id : 0x11117377
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Sync Boost : Not Active
Unknown : Not Active
FB Memory Usage
Total : 11168 MiB
Used : 43 MiB
Free : 11125 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 5 MiB
Free : 251 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 2 %
Encoder : 0 %
Decoder : 0 %
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 43 C
GPU Shutdown Temp : 96 C
GPU Slowdown Temp : 93 C
Power Readings
Power Management : Supported
Power Draw : 14.68 W
Power Limit : 300.00 W
Default Power Limit : 300.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 125.00 W
Max Power Limit : 330.00 W
Clocks
Graphics : 240 MHz
SM : 240 MHz
Memory : 405 MHz
Video : 544 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 1999 MHz
SM : 1999 MHz
Memory : 5505 MHz
Video : 1708 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
Process ID : 1868
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 40 MiB
我不知道问题出在哪里,也不知道该如何解决。有人能帮帮我吗?
答案1
尝试以 root 身份运行 cuda 程序。在类似情况下,我在 14.04 机器上也见过这种情况。它应该会在下次重启前修复这个问题。