在 Ubuntu 16.04 上安装 Cuda 10.0(适用于 DGX-1)

在 Ubuntu 16.04 上安装 Cuda 10.0(适用于 DGX-1)

我正在尝试在 DGX-1 服务器上运行的 Ubuntu 16.04 上安装 CUDA-10.0。我按照“运行文件安装”中的说明进行操作https://docs.nvidia.com/cuda/archive/10.0/cuda-installation-guide-linux/index.html#runfile

我选择安装 CUDA 驱动程序、CUDA 工具包和 CUDA 示例。

使用以下方法删除了 Nvidia 驱动程序和 CUDA 的先前版本(如如何在 Ubuntu 16.04 上安装 CUDA?):

sudo apt-get purge nvidia-cuda*
sudo apt-get purge nvidia-*

在步骤之后4.2.6(即重新启动系统以重新加载图形界面。),我检查了CUDA版本如下:

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

但是,当我运行“nvidia-smi”时,出现以下错误:

nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

我走向4.4(设备节点验证),发现设备文件“/dev/nvidia*”不存在。我尝试手动创建它们,但是运行“modprobe”返回错误:

sudo /sbin/modprobe nvidia
modprobe: ERROR: could not insert 'nvidia': Exec format error

请帮忙解决这个问题。谢谢!

========================================================================== 其他详情。

lspci | grep -i nvidia
06:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
07:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
0a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
0b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
85:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
86:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
89:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
8a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)

uname -m && cat /etc/*release
x86_64
DGX_NAME="DGX Server"
DGX_PRETTY_NAME="NVIDIA DGX Server"
DGX_SWBUILD_DATE="2018-03-20"
DGX_SWBUILD_VERSION="3.1.6"
DGX_COMMIT_ID="1b0f58ecbf989820ce745a9e4836e1de5eea6cfd"
DGX_SERIAL_NUMBER=QTFCOU8280021
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.6 LTS"
NAME="Ubuntu"
VERSION="16.04.6 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.6 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"

gcc --version
gcc (GCC) 5.4.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

uname -r
4.4.0-142-generic

cat /proc/version
Linux version 4.4.0-142-generic (buildd@lgw01-amd64-033) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) ) #168-Ubuntu SMP Wed Jan 16 21:00:45 UTC 2019

dpkg -l | grep nvidia
ii  dgx-peer-mem-loader                             1.1-10                                        amd64        Ensure nvidia is loaded before nv_peer_mem

相关内容