我在 Azure 上设置了一个Data Science Virtual Machine for Linux (Ubuntu)
,并想检查 GPU 的安装情况这些 TensorFlow 方向第一个命令显示 Tesla M60 有可用的 GPU:
$ lspci | grep -i nvidia
db4d:00:00.0 VGA compatible controller: NVIDIA Corporation GM204GL [Tesla M60] (rev a1)
第二条命令失败并显示一条神秘消息:
$ sudo docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=10.0 brand=tesla,driver>=384,driver<385 --pid=31149 /data/docker/overlay2/16e2b65fa0831681029432e3936005fa2796afd6d5a50c297d6bc0693e57a0b0/merged]\\\\nnvidia-container-cli: requirement error: unsatisfied condition: driver < 385\\\\n\\\"\"": unknown.
如何设置一台机器来运行 Nvidia docker 镜像?
答案1
此 NVIDIA GitHub 问题错误消息的这一部分:
--require=cuda>=10.0 brand=tesla,driver>=384,driver<385
这表明这是驱动程序问题。我不太明白为什么。
使用 Docker 的解决方案,但没有镜像
最简单的解决方案是使用不同的 Azure 映像:两者都NVIDIA GPU Cloud Image
将NVIDIA GPU Cloud Image for Deep Learning and HPC
运行该 Docker 映像。
使用您的图像的解决方案,但不使用 Docker
或者,您仍然可以使用Data Science Virtual Machine for Linux (Ubuntu)
Docker,但无需容器化。例如,Conda 可以设置一个环境(其中yes |
对安装软件包的提示的初始回答是):
yes | conda create -n TF python=2.7 scipy==1.0.0 tensorflow-gpu==1.8 Keras==2.1.3 pandas==0.22.0 numpy==1.14.0 matplotlib scikit-learn
export PATH=$PATH:/data/anaconda/envs/TF/bin
export PATH=$PATH:/data/anaconda/envs/py35/bin
这些命令从 Tensorflow 中提取官方模型:
git clone https://github.com/tensorflow/models.git
export PYTHONPATH="$PYTHONPATH:./models"
第一次调用nvidia-smi
显示 GPU 没有正在运行的进程:
$ nvidia-smi
Mon Jan 21 16:26:02 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.44 Driver Version: 396.44 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 On | 0000DB4D:00:00.0 Off | Off |
| N/A 39C P8 14W / 150W | 0MiB / 8129MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
当你让官方 MNIST 模型在后台运行一段时间时,你将看到一个使用 GPU 的进程:
$ python models/official/mnist/mnist.py &
[1] 25967
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.44 Driver Version: 396.44 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 On | 0000DB4D:00:00.0 Off | Off |
| N/A 37C P0 77W / 150W | 7851MiB / 8129MiB | 93% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 26077 C python 7840MiB |
+-----------------------------------------------------------------------------+