Ubuntu 18.04.3:需要 ROCm tensorflow 帮助:构建错误

Ubuntu 18.04.3:需要 ROCm tensorflow 帮助:构建错误

我全新安装了 Ubuntu 18.04.3,并且正在尝试安装 tensorflow-rocm(适用于 AMD GPU)版本 1.14.0。

默认 pip3 install tensorflow-rocm 正在安装 v2.0,但我使用的代码集是在 1.14 上制作的,因此当我尝试在 v2.0 上运行相同代码时会出现一些错误,主要是因为包的移动方式。

所以我找到了 tensorflow-rocm v 1.14.0 的源代码,但是当我尝试构建它时,我遇到了错误。我不知道为什么。我检查了我的系统上是否安装了 rocm,并且根据其官方网站,它已安装。

我遇到的错误如下:

Starting local Bazel server and connecting to it...
ERROR: Skipping '//tensorflow/tools/pip_package:build_pip_package': error loading package 'tensorflow/tools/pip_package': Encountered error while reading extension file 'rocm/build_defs.bzl': no such package '@local_config_rocm//rocm': Traceback (most recent call last):
    File "/home/heyitsabi/tensorflow-upstream/third_party/gpus/rocm_configure.bzl", line 861
        _create_local_rocm_repository(repository_ctx)
    File "/home/heyitsabi/tensorflow-upstream/third_party/gpus/rocm_configure.bzl", line 682, in _create_local_rocm_repository
        make_copy_dir_rule(repository_ctx, name = "rccl-inclu...", <2 more arguments>)
    File "/home/heyitsabi/tensorflow-upstream/third_party/gpus/cuda_configure.bzl", line 923, in make_copy_dir_rule
        _read_dir(repository_ctx, src_dir)
    File "/home/heyitsabi/tensorflow-upstream/third_party/gpus/cuda_configure.bzl", line 956, in _read_dir
        _execute(repository_ctx, ["find", src_dir, ..."], ...)
    File "/home/heyitsabi/tensorflow-upstream/third_party/gpus/cuda_configure.bzl", line 887, in _execute
        auto_configure_fail("\n".join([error_msg.strip() if ... ""]))
    File "/home/heyitsabi/tensorflow-upstream/third_party/gpus/cuda_configure.bzl", line 324, in auto_configure_fail
        fail(("\n%sCuda Configuration Error:%...)))

Cuda Configuration Error: Repository command failed
find: ‘/opt/rocm/rccl/include’: No such file or directory

WARNING: Target pattern parsing failed.
ERROR: error loading package 'tensorflow/tools/pip_package': Encountered error while reading extension file 'rocm/build_defs.bzl': no such package '@local_config_rocm//rocm': Traceback (most recent call last):
    File "/home/heyitsabi/tensorflow-upstream/third_party/gpus/rocm_configure.bzl", line 861
        _create_local_rocm_repository(repository_ctx)
    File "/home/heyitsabi/tensorflow-upstream/third_party/gpus/rocm_configure.bzl", line 682, in _create_local_rocm_repository
        make_copy_dir_rule(repository_ctx, name = "rccl-inclu...", <2 more arguments>)
    File "/home/heyitsabi/tensorflow-upstream/third_party/gpus/cuda_configure.bzl", line 923, in make_copy_dir_rule
        _read_dir(repository_ctx, src_dir)
    File "/home/heyitsabi/tensorflow-upstream/third_party/gpus/cuda_configure.bzl", line 956, in _read_dir
        _execute(repository_ctx, ["find", src_dir, ..."], ...)
    File "/home/heyitsabi/tensorflow-upstream/third_party/gpus/cuda_configure.bzl", line 887, in _execute
        auto_configure_fail("\n".join([error_msg.strip() if ... ""]))
    File "/home/heyitsabi/tensorflow-upstream/third_party/gpus/cuda_configure.bzl", line 324, in auto_configure_fail
        fail(("\n%sCuda Configuration Error:%...)))

Cuda Configuration Error: Repository command failed
find: ‘/opt/rocm/rccl/include’: No such file or directory

INFO: Elapsed time: 2.470s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)
    currently loading: tensorflow/tools/pip_package

tensorflow-rcom 1.14.0 的源站

答案1

因此,在尝试了几个小时寻找答案之后,我终于找到了 rccl 库(如果它被这样称呼的话)如果这是 ROCm 安装指南中提到的要求,我就会知道......遗憾的是我完全错过了它或者它不在那里。

从 git 克隆它

然后使用

sudo ./install.sh -i

现在我的 tensorflow 包正在制作中。如果出现任何其他错误,很可能是由于上述问题以外的其他原因造成的,所以我分享了这个答案。

顺便说一下,正常的./install.sh -i 最终出现错误,说它无法生成所需的文件,因为它没有访问权限,所以我不得不使用 sudo。

RCCL 封装

相关内容