我全新安装了 Ubuntu 18.04.3,并且正在尝试安装 tensorflow-rocm(适用于 AMD GPU)版本 1.14.0。
默认 pip3 install tensorflow-rocm 正在安装 v2.0,但我使用的代码集是在 1.14 上制作的,因此当我尝试在 v2.0 上运行相同代码时会出现一些错误,主要是因为包的移动方式。
所以我找到了 tensorflow-rocm v 1.14.0 的源代码,但是当我尝试构建它时,我遇到了错误。我不知道为什么。我检查了我的系统上是否安装了 rocm,并且根据其官方网站,它已安装。
我遇到的错误如下:
Starting local Bazel server and connecting to it...
ERROR: Skipping '//tensorflow/tools/pip_package:build_pip_package': error loading package 'tensorflow/tools/pip_package': Encountered error while reading extension file 'rocm/build_defs.bzl': no such package '@local_config_rocm//rocm': Traceback (most recent call last):
File "/home/heyitsabi/tensorflow-upstream/third_party/gpus/rocm_configure.bzl", line 861
_create_local_rocm_repository(repository_ctx)
File "/home/heyitsabi/tensorflow-upstream/third_party/gpus/rocm_configure.bzl", line 682, in _create_local_rocm_repository
make_copy_dir_rule(repository_ctx, name = "rccl-inclu...", <2 more arguments>)
File "/home/heyitsabi/tensorflow-upstream/third_party/gpus/cuda_configure.bzl", line 923, in make_copy_dir_rule
_read_dir(repository_ctx, src_dir)
File "/home/heyitsabi/tensorflow-upstream/third_party/gpus/cuda_configure.bzl", line 956, in _read_dir
_execute(repository_ctx, ["find", src_dir, ..."], ...)
File "/home/heyitsabi/tensorflow-upstream/third_party/gpus/cuda_configure.bzl", line 887, in _execute
auto_configure_fail("\n".join([error_msg.strip() if ... ""]))
File "/home/heyitsabi/tensorflow-upstream/third_party/gpus/cuda_configure.bzl", line 324, in auto_configure_fail
fail(("\n%sCuda Configuration Error:%...)))
Cuda Configuration Error: Repository command failed
find: ‘/opt/rocm/rccl/include’: No such file or directory
WARNING: Target pattern parsing failed.
ERROR: error loading package 'tensorflow/tools/pip_package': Encountered error while reading extension file 'rocm/build_defs.bzl': no such package '@local_config_rocm//rocm': Traceback (most recent call last):
File "/home/heyitsabi/tensorflow-upstream/third_party/gpus/rocm_configure.bzl", line 861
_create_local_rocm_repository(repository_ctx)
File "/home/heyitsabi/tensorflow-upstream/third_party/gpus/rocm_configure.bzl", line 682, in _create_local_rocm_repository
make_copy_dir_rule(repository_ctx, name = "rccl-inclu...", <2 more arguments>)
File "/home/heyitsabi/tensorflow-upstream/third_party/gpus/cuda_configure.bzl", line 923, in make_copy_dir_rule
_read_dir(repository_ctx, src_dir)
File "/home/heyitsabi/tensorflow-upstream/third_party/gpus/cuda_configure.bzl", line 956, in _read_dir
_execute(repository_ctx, ["find", src_dir, ..."], ...)
File "/home/heyitsabi/tensorflow-upstream/third_party/gpus/cuda_configure.bzl", line 887, in _execute
auto_configure_fail("\n".join([error_msg.strip() if ... ""]))
File "/home/heyitsabi/tensorflow-upstream/third_party/gpus/cuda_configure.bzl", line 324, in auto_configure_fail
fail(("\n%sCuda Configuration Error:%...)))
Cuda Configuration Error: Repository command failed
find: ‘/opt/rocm/rccl/include’: No such file or directory
INFO: Elapsed time: 2.470s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)
currently loading: tensorflow/tools/pip_package
答案1
因此,在尝试了几个小时寻找答案之后,我终于找到了 rccl 库(如果它被这样称呼的话)如果这是 ROCm 安装指南中提到的要求,我就会知道......遗憾的是我完全错过了它或者它不在那里。
从 git 克隆它
然后使用
sudo ./install.sh -i
现在我的 tensorflow 包正在制作中。如果出现任何其他错误,很可能是由于上述问题以外的其他原因造成的,所以我分享了这个答案。
顺便说一下,正常的./install.sh -i 最终出现错误,说它无法生成所需的文件,因为它没有访问权限,所以我不得不使用 sudo。