tensorflow / tensorflow

An Open Source Machine Learning Framework for Everyone

Home Page:https://tensorflow.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

tensorflow for Nvidia TX1

jmtatsch opened this issue · comments

Hello,

@maxcuda has recently got tensorflow running on the tk1 as documented in blogpost http://cudamusing.blogspot.de/2015/11/building-tensorflow-for-jetson-tk1.html but since then been unable to repeatedly build it. I am now trying to get tensorflow running on a tx1 tegra platform and need some support.

Much trouble seems to come from Eigen variadic templates and using C++11 initializer lists, both of wich could work according to http://devblogs.nvidia.com/parallelforall/cplusplus-11-in-cuda-variadic-templates/.
In theory std=c++11 should be set according to crosstool. Nevertheless, nvcc crashes happily on all of them. This smells as if the "-std=c++11" flag is not properly set.
How can I verify/enforce this?

Also in tensorflow.bzl, variadic templates in Eigen are said to be disabled
We have to disable variadic templates in Eigen for NVCC even though std=c++11 are enabled
is that still necessary?

Here is my build workflow:

git clone —recurse-submodules git@github.com:jmtatsch/tensorflow.git
cd tensorflow
grep -Rl "lib64"| xargs sed -i 's/lib64/lib/g' # no lib64 for tx1 yet 
./configure
bazel build -c opt --local_resources 2048,0.5,1.0 --verbose_failures --config=cuda //tensorflow/cc:tutorials_example_trainer
commented

Are you using jetpack 2?

No, JetPack does not support running directly on the L4T platform.

commented

I meant if you have flashed the board with jetpack 2 to have cuda 7 support.

Ah, yes I have Cuda 7 support and used jetpack 2. To be more precise, the target is not actually the Jetson TX1 but an repurposed Nvida Sield TV flashed to L4T 23.1 for Jetson.

I think there is a TX1 that I could use to take a look. I'll see what I can do.

In theory, can TensorFlow run usefully on the TK1? Or is the 2G memory too small for, say, face verification?

@robagar It all depends on how large your network is and whether you intend to train the model on TK1 or just run inference. Two GB of memory is plenty to run inference on almost any model.

I have worked around an issue that prevented nvcc from compiling the Eigen codebase on Tegra X1 (https://bitbucket.org/eigen/eigen/commits/d0950ac79c0404047379eb5a927a176dbb9d12a5).
However, so far I haven't succeeded in setting up bazel on the Tegra X1, so I haven't been able to start working on the other issues reported in http://cudamusing.blogspot.de/2015/11/building-tensorflow-for-jetson-tk1.html

That's good news ;) Whats the problem with bazel? maxcuda's instructions for building bazel worked quite well for me..

For building bazel I had to use a special java build which can cope with the 32bit rootfs on a 64bit machine

wget http://www.java.net/download/jdk8u76/archive/b02/binaries/jdk-8u76-ea-bin-b02-linux-arm-vfp-hflt-04_jan_2016.tar.gz
sudo tar -zxvf jdk-8u76-ea-bin-b02-linux-arm-vfp-hflt-04_jan_2016.tar.gz -C /usr/lib/jvm
sudo update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.8.0_76/bin/java" 1
sudo update-alternatives --config java

There seems to be one eigen issue I can't get around:

bazel build -c opt --local_resources 2048,0.5,1.0 --verbose_failures --config=cuda //tensorflow/cc:tutorials_example_trainer
WARNING: Sandboxed execution is not supported on your system and thus hermeticity of actions cannot be guaranteed. See http://bazel.io/docs/bazel-user-manual.html#sandboxing for more information. You can turn off this warning via --ignore_unsupported_sandboxing.
INFO: Found 1 target...
INFO: From Compiling tensorflow/core/kernels/cross_op_gpu.cu.cc:
At end of source: warning: routine is both "inline" and "noinline"

external/eigen_archive/eigen-eigen-c5e90d9e764e/unsupported/Eigen/CXX11/src/Tensor/TensorEvaluator.h(125): warning: routine is both "inline" and "noinline"

At end of source: warning: routine is both "inline" and "noinline"

external/eigen_archive/eigen-eigen-c5e90d9e764e/unsupported/Eigen/CXX11/src/Tensor/TensorEvaluator.h(125): warning: routine is both "inline" and "noinline"

./tensorflow/core/lib/strings/strcat.h(195): internal error: assertion failed at: "/dvs/p4/build/sw/rel/gpu_drv/r346/r346_00/drivers/compiler/edg/EDG_4.9/src/decl_inits.c", line 3251


1 catastrophic error detected in the compilation of "/tmp/tmpxft_0000682d_00000000-8_cross_op_gpu.cu.cpp4.ii".
Compilation aborted.
Aborted
ERROR: /opt/tensorflow/tensorflow/core/BUILD:331:1: output 'tensorflow/core/_objs/gpu_kernels/tensorflow/core/kernels/cross_op_gpu.cu.o' was not created.
ERROR: /opt/tensorflow/tensorflow/core/BUILD:331:1: not all outputs were created.
Target //tensorflow/cc:tutorials_example_trainer failed to build
INFO: Elapsed time: 2271.358s, Critical Path: 2260.25s

Can you have a look at TensorEvaluator.h please?

I still haven't been able to install bazel. That said, the assertion you're facing seems to be triggered by the variadic template at line 195 of ./tensorflow/core/lib/strings/strcat.h. I would just comment this code and see how it goes.

When you say maxcuda has "been unable to repeatedly build it" since then, does that mean that tensorflow is no longer working on the TK1 again? Because I just ordered the TK1 with the express purpose of being able to run tensorflow :-/

Yes, I have been unable to recompile the latest versions. The wheel I built around Thanksgiving should still work but it is quite an old version.

Commenting the variadic template at line 195 helps a little but at line 234 there is a another template that seems to be required. Any hints how to rewrite that in nvcc friendly manner?

@benoitsteiner
any suggestions how this could be rewritten in a nvcc compatible manner?

// Support 5 or more arguments
template <typename... AV>
inline void StrAppend(string *dest, const AlphaNum &a, const AlphaNum &b,
                      const AlphaNum &c, const AlphaNum &d, const AlphaNum &e,
                      const AV &... args) {
  internal::AppendPieces(dest,
                         {a.Piece(), b.Piece(), c.Piece(), d.Piece(), e.Piece(),
                          static_cast<const AlphaNum &>(args).Piece()...});
}

Hi folks, I'm also working on building everything from scratch on tx1. There is lots of discussions here and also on nvidia developer forums. But by now I haven't seen any well summarized instruction besides that tk1's. Can we start another repo or script file so people can work on it more efficient?

Imho we have to first solve the fundamental issue of the variadic templates not working with nvcc. Either the developers would have to do without those templates which is backwards and probably not going to happen or nvidia has to step up and make nvcc more compatible? In theory nvcc should already be able to deal with your own variadic templates, but external e.g. STL headers won't "just work" because of the need to annotate all functions called on the device with "host device". Maybe someone knows a good way how to get around this issue....

@jmtatsch At the moment, the version of cuda that is shipped with the tegra x1 has problems with variadic templates. Nvidia is aware of this and working on a fix. I updated Eigen a few weeks ago to disable the use of variadic templates when compiling on tegra x1, and that seems to fix the bulk of the problem. However, StrCat and StrAppend still rely on variadic templates. Until nvidia releases a fix, the best solution is to comment out the variadic versions of StrCat and StrAppend, and create non variadic versions of StrCat and StrAppend with up to 11 arguments (since that's what TensorFlow currently needs).
There are a couple of ways to avoid the STL issues: a brittle solution is to only compile optimized kernels. The compiler then inlines the STL code at which point the lack of host device annotation doesn't matter since there is no function call to resolve. A better solution is to replace all the STL functionality with custom code. We've started to do this in Eigen by reimplementing most of the STL functions we need in the Eigen::numext namespace. This is tedious by much more reliable than relying on inlining to bypass the problem.

I have a build of TF 0.8 but it requires a new 7.0 compiler that is not yet available to the general public.
I am building a wheel on a Jetson TK1, I will make it available after some testing.
I will update the instructions on how to build from source on cudamusing.

Good work @maxcuda! Will it build on the TX1 too?

Yes, it will build on TX1 too. I fixed a problem with the new memory allocator to take in account the 32bit OS. Some basic tests are passing but the label_image test is giving the wrong results so there may be some other places with 32bit issues.

@benoitsteiner , with the new compiler your change to Eigen is not required anymore ( and it is forcing to edit a bunch of files). Could you please remove the check and re-enable variadic templates ?

@maxcuda Where can I download the new cuda compiler? I'd like to make sure that I don't introduce new problems when I enable variadic templates again.

@maxcuda is the new 7.0 compiler you were referencing part of Jetpack 2.2 that was just released?

Yes, you can get it with:
wget http://developer.download.nvidia.com/embedded/L4T/r24_Release_v1.0/CUDA/cuda-repo-l4t-7-0-local_7.0-76_armhf.deb

The good news are that I was able to build v0.8 but some of the results are incorrect. I will update the blog with the changes. With v0.9 I had problem with the cudnn.cc file, it looks like it cannot handle cuddn v2.

Thanks so much. Looking forward to your post so I can get tensorflow running on the TX1

I updated my building instruction on cudamusing and also posted a wheel file.

Has anyone tested this on jetson tx1? I can't seem to get bazel build on aarch64.

@syed-ahmed I tested it on TX1. This is my configurations.

  • Cuda Toolkit 7.0, JetPack 2.2(32bit)
  • Bazel 0.2.1
  • jdk-8u76-ea-bin-b02-linux-arm-vfp-hflt-04_jan_2016.tar.gz
  • ./configure : compute capability 5.3
  • bazel option : --local_resources 2048,2.0,1.0

@syed-ahmed I got it to build on an aarch64 TX1. I mostly followed the instructions for the TK1 at cudamusing.blogspot.de. The only additional things I did were

  • add aarch64 to the ARM enum in /bazel/src/main/java/com/google/devtools/build/lib/util/CPU.java by changing line 28 to "ARM("arm", ImmutableSet.of("arm", "armv7l", "aarch64"))," without quotes
  • Added aarch64 as valid ARM machine type in /bazel/scripts/bootstrap/buildenv.sh by changing line 35 to "if [ "${MACHINE_TYPE}" = 'arm' -o "${MACHINE_TYPE}" = 'armv7l' -o "${MACHINE_TYPE}" = 'aarch64' ]; then" without quotes

Or, if you prefer, here is the bazel executable for aarch64 I ended up with: https://drive.google.com/file/d/0B8Gc_oVaYC7CWEhOMHJhc0hLY0U/view?usp=sharing

Maybe make a PR against bazel?

On Wed, Jul 6, 2016 at 8:38 AM Tyler Fox notifications@github.com wrote:

@syed-ahmed https://github.com/syed-ahmed I got it to build on an
aarch64 TX1. I mostly followed the instructions for the TK1 at
cudamusing.blogspot.de. The only additional things I did was

  • add aarch64 to the ARM enum in
    /bazel/src/main/java/com/google/devtools/build/lib/util/CPU.java by
    changing line 28 to "ARM("arm", ImmutableSet.of("arm", "armv7l",
    "aarch64"))," without quotes
  • Added aarch64 as valid ARM machine type in
    /bazel/scripts/bootstrap/buildenv.sh by changing line 35 to "if [
    "${MACHINE_TYPE}" = 'arm' -o "${MACHINE_TYPE}" = 'armv7l' -o
    "${MACHINE_TYPE}" = 'aarch64' ]; then" without quotes

Or, if you prefer, here is the bazel executable for aarch64 I ended up
with:
https://drive.google.com/file/d/0B8Gc_oVaYC7CWEhOMHJhc0hLY0U/view?usp=sharing


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#851 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAjO_SFJWCHTe1vT-jcv8t5tp51x9clmks5qS8vjgaJpZM4HK5_C
.

@tylerfox Thank you! I'll try your suggestions. In the meanwhile, any thoughts on this: bazelbuild/bazel#1264 and @wtfuzz 's change for cc_configure.bzl. I was getting a toolchain error. So wondering if you encountered it.

Did you also build with latest bazel release or 0.1.4.? And how about the tensorflow version - r0.8?

@syed-ahmed yes, changing the buildenv.sh should fix that issue. Also it's worth noting that I used bazel 0.1.4 per the instructions on cudamusing. I should probably also test on the current version of bazel, but for now I know 0.1.4 works

I am trying to build the tensorflow r0.9 release. I got bazel 0.2.1 installed following @tylerfox 's suggestions. Getting this following error when trying to build tensorflow. Any thoughts? Appreciate all the help.

>>>>> # @farmhash_archive//:configure [action 'Executing genrule @farmhash_archive//:configure [for host]']
(cd /home/ubuntu/.cache/bazel/_bazel_ubuntu/ad1e09741bb4109fbc70ef8216b59ee2/tensorflow && \
  exec env - \
    PATH=/usr/local/cuda-7.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/ubuntu/bazel/output/ \
  /bin/bash -c 'source external/bazel_tools/tools/genrule/genrule-setup.sh; pushd external/farmhash_archive/farmhash-34c13ddfab0e35422f4c3979f360635a8c050260; workdir=$(mktemp -d -t tmp.XXXXXXXXXX); cp -a * $workdir; pushd $workdir; ./configure; popd; popd; cp $workdir/config.h bazel-out/host/genfiles/external/farmhash_archive/farmhash-34c13ddfab0e35422f4c3979f360635a8c050260; rm -rf $workdir;')
ERROR: /home/ubuntu/.cache/bazel/_bazel_ubuntu/ad1e09741bb4109fbc70ef8216b59ee2/external/farmhash_archive/BUILD:5:1: Executing genrule @farmhash_archive//:configure failed: bash failed: error executing command 
  (cd /home/ubuntu/.cache/bazel/_bazel_ubuntu/ad1e09741bb4109fbc70ef8216b59ee2/tensorflow && \
  exec env - \
    PATH=/usr/local/cuda-7.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/ubuntu/bazel/output/ \
  /bin/bash -c 'source external/bazel_tools/tools/genrule/genrule-setup.sh; pushd external/farmhash_archive/farmhash-34c13ddfab0e35422f4c3979f360635a8c050260; workdir=$(mktemp -d -t tmp.XXXXXXXXXX); cp -a * $workdir; pushd $workdir; ./configure; popd; popd; cp $workdir/config.h bazel-out/host/genfiles/external/farmhash_archive/farmhash-34c13ddfab0e35422f4c3979f360635a8c050260; rm -rf $workdir;'): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 1.
/home/ubuntu/.cache/bazel/_bazel_ubuntu/ad1e09741bb4109fbc70ef8216b59ee2/tensorflow/external/farmhash_archive/farmhash-34c13ddfab0e35422f4c3979f360635a8c050260 /home/ubuntu/.cache/bazel/_bazel_ubuntu/ad1e09741bb4109fbc70ef8216b59ee2/tensorflow
/tmp/tmp.ZKGtjQ4mLO /home/ubuntu/.cache/bazel/_bazel_ubuntu/ad1e09741bb4109fbc70ef8216b59ee2/tensorflow/external/farmhash_archive/farmhash-34c13ddfab0e35422f4c3979f360635a8c050260 /home/ubuntu/.cache/bazel/_bazel_ubuntu/ad1e09741bb4109fbc70ef8216b59ee2/tensorflow
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... no
checking for mawk... mawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking build system type... /tmp/tmp.ZKGtjQ4mLO/missing: Unknown `--is-lightweight' option
Try `/tmp/tmp.ZKGtjQ4mLO/missing --help' for more information
configure: WARNING: 'missing' script is too old or missing
./config.guess: unable to guess system type

This script, last modified 2010-08-21, has failed to recognize
the operating system you are using. It is advised that you
download the most up to date version of the config scripts from

  http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess;hb=HEAD
and
  http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.sub;hb=HEAD

If the version you run (./config.guess) is already up to date, please
send the following data and any information you think might be
pertinent to <config-patches@gnu.org> in order to provide the needed
information to handle your system.

config.guess timestamp = 2010-08-21

uname -m = aarch64
uname -r = 3.10.96-tegra
uname -s = Linux
uname -v = #1 SMP PREEMPT Tue May 17 16:29:05 PDT 2016

/usr/bin/uname -p = 
/bin/uname -X     = 

hostinfo               = 
/bin/universe          = 
/usr/bin/arch -k       = 
/bin/arch              = 
/usr/bin/oslevel       = 
/usr/convex/getsysinfo = 

UNAME_MACHINE = aarch64
UNAME_RELEASE = 3.10.96-tegra
UNAME_SYSTEM  = Linux
UNAME_VERSION = #1 SMP PREEMPT Tue May 17 16:29:05 PDT 2016
configure: error: cannot guess build type; you must specify one

Anyone knows what farmhash is being used for in tensorflow r0.9? My motivation for installing tensorflow 0.9 on the jetson tx1 is to solely utilize some of the fp16 ops. Hence, if farmhash is not doing anything important, may be I could remove the firmhash related code and build without it. Here is the farmhash commit.

Temporary sources used in the build process can be found in ~/.cache/bazel. Cd to this directory and search for config.guess: find ./ -name "config.guess".
You might get several files but the paths should give you a clue which config.guess is the one of farmhash. In my case it is ./_bazel_ubuntu/ad1e09741bb4109fbc70ef8216b59ee2/external/farmhash_archive/farmhash-34c13ddfab0e35422f4c3979f360635a8c050260/config.guess
In this file replace line
UNAME_MACHINE=(uname -m) 2>/dev/null || UNAME_MACHINE=unknown
with
UNAME_MACHINE=armhf

On my machine (Nvidia Shield TV flashed to L4T 23.1) farmhash built successfully after this change.

I successfully build the tensorflow on TX1 24.1 64 bit, with the following patch. But, run example failed with following kernel message.
tutorials_examp[31026]: unhandled level 1 translation fault (11) at 0xffffffffffffe8, esr 0x92000005

Maybe farmhard.BUILD with --build=arm-linux-gnu is wrong? But, I failed to compile it with --build=aaarch64-linux-gnu. I'm still trying to figure out what caused the runtime fails.
tx1_patch.zip

@benoitsteiner has reenablling variadic templates been verified to work?

@shingchuang have you found the root cause of segmentation fault issue? I have the same problem on aarch64 platform.

I tried to reenable variadic templates last night after upgrading the cuda compiler using http://developer.download.nvidia.com/embedded/L4T/r24_Release_v1.0/CUDA/cuda-repo-l4t-7-0-local_7.0-76_armhf.deb. This new compiler appears to fix some of the issues but I still get some crashes.

I noticed that nvidia released an even more recent version of the compiler. @maxcuda, is there a debian package that I can use to install the latest version of the cuda sdk ?

Re-install / re-flash using JetPack 2.3 because the latest release also updated to Ubuntu 16.04 aarch64 in addition to CUDA 8 and L4T R24.2. The underlying CUDA version is tied to the L4T BSP in JetPack.

Hi all. I'm trying to build TensorFlow for the Google Pixel C in order to use the GPU TX1. Do you build it on your machine (e.g. Mac) or on the device itself (e.g. Pixel C)? Does anyone have the already generated files for TX1 or can point me in the right direction? Thanks.

Hi all - haven't gotten TensorFlow r0.11 working yet, but do have a working path to r0.9 TensorFlow install on TX1 with JetPack 2.3. Have tested basic nets MLP/LSTM/Conv and seems to work, though it OOMS out pretty easily on bigger convs.

Wrote down all my steps and patches below if it's helpful to anyone. Really appreciated all above commentary was critical to tracking down right path.

http://stackoverflow.com/questions/39783919/tensorflow-on-nvidia-tx1/

commented

@dwightcrow , I tried your solution, and it works on TX1, thank you. And the version 0.11.0rc0 can be built with bazel with version of 0.3.2

That's fantastic. Bazel 0.3.2 builds fairly easily on TX1?

Wondering if there's a concise summary of everything in this issue? It would definitely make it easier for others trying to get TF working on a TX1.

Following up on the request for a summary to build tensorflow on a Jetson TX1. Any help is appreciated.

The problem is that there are too many moving pieces. Each set of instructions may fail when Bazel/Protobuf/Eigen/TF are updated.

@dwightcrow Hi Dwight, at some point in the instructions you say:
"Need an edit to recognize aarch64 as ARM"

Can you please expand, edit what? Also, can we update the answer to build the latest version?

I agree with @sunils27 and @maxcuda that we need a more stable set of instructions for specific components..

Thank you very much for the effort and time to support the community.

Furthermore, if there is a stable set of build instructions, it becomes accessible to more people who can help in its upkeep when the aforementioned packages (by @maxcuda ) are updated.

I've reenabled support for variadic templates on Tegra-X1 provided that one uses JetPack 2.3 (in previous versions nvcc crashes when compiling some of the variadic templates). I haven't tried yet to compile TensorFlow itself but this should reduce the number of code changes necessary to work around the lack of IndexList on Tegra.

While a stable set of instructions may remain elusive, one effective way of documenting a working set is to create your own fork of each of the repos and push any changes you need to make as commits on one branch for each version of TF you're targeting. Then in a write-up you can refer to specfic branches / commits that are known to work. You could even go a step further by creating a meta repo which has references to each of those commits; git submodules (as much as I dislike them) are one way, another is using simple scripts to automate what your writeup describes.

In other words: have a personal github fork of basel, tensorflow, etc. and a branch called something like "topic/tf_v0.10" on each fork. Then optionally a new repo altogether which unifies them and a community of folks such as we have on this thread could collaborate to push updates to it as we try different things.

right, is anyone able to advise where do those changes need to go in the bazel part of the instructions on SstackOverflow? Any help is greatly appreciated. While this doesn't solve the bigger problems with getting tensorflow to work on SATV, it does offer me (and others) the chance to get it going in the current format

Build tensorflow r0.11 on Nvidia TX1 failed

Error message:

ERROR: .../tensorflow/core/kernels/BUILD:1096:1: C++ compilation of rule '//tensorflow/core/kernels:svd_op' failed: crosstool_wrapper_driver_is_not_gcc failed: error executing command
 ...
com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 4.
gcc: internal compiler error: Killed (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-5.4/README.Bugs> for instructions.
Target //tensorflow/cc:tutorials_example_trainer failed to build

My build steps and environment:

Environment

  • Hardware: Nvidia TX1
  • OS: JetPack 2.3 (Ubuntu 16.04)
  • cuDNN:5.1
  • CUDA: 8

Install Java

$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer

Install some deps

$ sudo apt-get install git zip unzip autoconf automake libtool curl zlib1g-dev maven
$ sudo apt-get install python-numpy swig python-dev python-wheel

Build protobuf

# For grpc-java build
$ git clone https://github.com/google/protobuf.git
$ cd protobuf
$ git checkout master
$ ./autogen.sh
$ git checkout v3.0.0-beta-3
$ ./autogen.sh
$ LDFLAGS=-static ./configure --prefix=$(pwd)/../
$ sed -i -e 's/LDFLAGS = -static/LDFLAGS = -all-static/' ./src/Makefile
$ make -j 4
$ make install


# For bazel build
$ git checkout v3.0.0-beta-2
$./autogen.sh
$ LDFLAGS=-static ./configure --prefix=$(pwd)/../
$ sed -i -e 's/LDFLAGS = -static/LDFLAGS = -all-static/' ./src/Makefile
$ make -j 4
$ cd ..

Build grpc-java compiler

$ git clone https://github.com/neo-titans/odroid.git
$ git clone https://github.com/grpc/grpc-java-git
$ cd grpc-java/
$ git checkout v0.15.0
$ patch -p0 < ../odroid/build_tensorflow/grpc-java.v0.15.0.patch
$ CXXFLAGS="-I$(pwd)/../include" LDFLAGS="-L$(pwd)/../lib" ./gradlew java_pluginExecutable -Pprotoc=$(pwd)/../bin/protoc
$ cd ..

Build bazel

$ git clone https://github.com/bazelbuild/bazel.git
$ cd bazel
$ git checkout 0.3.2
$ cp ../protobuf/src/protoc third_party/protobuf/protoc-linux-arm32.exe
$ cp ../grpc-java/compiler/build/exe/java_plugin/protoc-gen-grpc-java third_party/grpc/protoc-gen-grpc-java-0.15.0-linux-arm32.exe

Modify some files for build on aarch64

diff --git a/compile.sh b/compile.sh
index 53fc412..11035d9 100755
--- a/compile.sh
+++ b/compile.sh
@@ -27,7 +27,7 @@ cd "$(dirname "$0")"
 # Set the default verbose mode in buildenv.sh so that we do not display command
 # output unless there is a failure.  We do this conditionally to offer the user
 # a chance of overriding this in case they want to do so.
-: ${VERBOSE:=no}
+: ${VERBOSE:=yes}

 source scripts/bootstrap/buildenv.sh

diff --git a/scripts/bootstrap/compile.sh b/scripts/bootstrap/compile.sh
index 77372f0..657b254 100755
--- a/scripts/bootstrap/compile.sh
+++ b/scripts/bootstrap/compile.sh
@@ -48,6 +48,7 @@ linux)
   else
     if [ "${MACHINE_IS_ARM}" = 'yes' ]; then
       PROTOC=${PROTOC:-third_party/protobuf/protoc-linux-arm32.exe}
+      GRPC_JAVA_PLUGIN=${GRPC_JAVA_PLUGIN:-third_party/grpc/protoc-gen-grpc-java-0.15.0-linux-arm32.exe}
     else
       PROTOC=${PROTOC:-third_party/protobuf/protoc-linux-x86_32.exe}
       GRPC_JAVA_PLUGIN=${GRPC_JAVA_PLUGIN:-third_party/grpc/protoc-gen-grpc-java-0.15.0-linux-x86_32.exe}
@@ -150,7 +151,7 @@ function java_compilation() {

   run "${JAVAC}" -classpath "${classpath}" -sourcepath "${sourcepath}" \
       -d "${output}/classes" -source "$JAVA_VERSION" -target "$JAVA_VERSION" \
-      -encoding UTF-8 "@${paramfile}"
+      -encoding UTF-8 "@${paramfile}" -J-Xmx500M

   log "Extracting helper classes for $name..."
   for f in ${library_jars} ; do
diff --git a/src/main/java/com/google/devtools/build/lib/util/CPU.java b/src/main/java/com/google/devtools/build/lib/util/CPU.java
index 41af4b1..4d80610 100644
--- a/src/main/java/com/google/devtools/build/lib/util/CPU.java
+++ b/src/main/java/com/google/devtools/build/lib/util/CPU.java
@@ -26,7 +26,7 @@ public enum CPU {
   X86_32("x86_32", ImmutableSet.of("i386", "i486", "i586", "i686", "i786", "x86")),
   X86_64("x86_64", ImmutableSet.of("amd64", "x86_64", "x64")),
   PPC("ppc", ImmutableSet.of("ppc", "ppc64", "ppc64le")),
-  ARM("arm", ImmutableSet.of("arm", "armv7l")),
+  ARM("arm", ImmutableSet.of("arm", "armv7l", "aarch64")),
   UNKNOWN("unknown", ImmutableSet.<String>of());

   private final String canonicalName;
diff --git a/third_party/grpc/BUILD b/third_party/grpc/BUILD
index 2ba07e3..c7925ff 100644
--- a/third_party/grpc/BUILD
+++ b/third_party/grpc/BUILD
@@ -29,7 +29,7 @@ filegroup(
         "//third_party:darwin": ["protoc-gen-grpc-java-0.15.0-osx-x86_64.exe"],
         "//third_party:k8": ["protoc-gen-grpc-java-0.15.0-linux-x86_64.exe"],
         "//third_party:piii": ["protoc-gen-grpc-java-0.15.0-linux-x86_32.exe"],
-        "//third_party:arm": ["protoc-gen-grpc-java-0.15.0-linux-x86_32.exe"],
+        "//third_party:arm": ["protoc-gen-grpc-java-0.15.0-linux-arm32.exe"],
         "//third_party:freebsd": ["protoc-gen-grpc-java-0.15.0-linux-x86_32.exe"],
     }),
 )
diff --git a/third_party/protobuf/BUILD b/third_party/protobuf/BUILD
index 203fe51..4c2a316 100644
--- a/third_party/protobuf/BUILD
+++ b/third_party/protobuf/BUILD
@@ -28,6 +28,7 @@ filegroup(
         "//third_party:darwin": ["protoc-osx-x86_32.exe"],
         "//third_party:k8": ["protoc-linux-x86_64.exe"],
         "//third_party:piii": ["protoc-linux-x86_32.exe"],
+        "//third_party:arm": ["protoc-linux-arm32.exe"],
         "//third_party:freebsd": ["protoc-linux-x86_32.exe"],
     }),
 )
diff --git a/tools/cpp/cc_configure.bzl b/tools/cpp/cc_configure.bzl
index aeb0715..688835d 100644
--- a/tools/cpp/cc_configure.bzl
+++ b/tools/cpp/cc_configure.bzl
@@ -150,7 +150,12 @@ def _get_cpu_value(repository_ctx):
     return "x64_windows"
   # Use uname to figure out whether we are on x86_32 or x86_64
   result = repository_ctx.execute(["uname", "-m"])
-  return "k8" if result.stdout.strip() in ["amd64", "x86_64", "x64"] else "piii"
+  machine = result.stdout.strip()
+  if machine in ["arm", "armv7l", "aarch64"]:
+   return "arm"
+  elif machine in ["amd64", "x86_64", "x64"]:
+   return "k8"
+  return "piii"


 _INC_DIR_MARKER_BEGIN = "#include <...>"

compile

$ ./compile.sh 
$ cd..

Build Tensorflow

$ git clone https://github.com/tensorflow/tensorflow.git
$ git checkout v0.11.0.rc2

According to StackOverflow's tensorflow-on-nvidia-tx1 to modify

diff --git a/tensorflow/core/kernels/BUILD b/tensorflow/core/kernels/BUILD
index 2e04827..9d81923 100644
--- a/tensorflow/core/kernels/BUILD
+++ b/tensorflow/core/kernels/BUILD
@@ -1184,7 +1184,7 @@ tf_kernel_libraries(
         "segment_reduction_ops",
         "scan_ops",
         "sequence_ops",
-        "sparse_matmul_op",
+        #DC "sparse_matmul_op",
     ],
     deps = [
         ":bounds_check",
diff --git a/tensorflow/core/kernels/cwise_op_gpu_select.cu.cc b/tensorflow/core/kernels/cwise_op_gpu_select.cu.cc
index 02058a8..880a0c3 100644
--- a/tensorflow/core/kernels/cwise_op_gpu_select.cu.cc
+++ b/tensorflow/core/kernels/cwise_op_gpu_select.cu.cc
@@ -43,8 +43,14 @@ struct BatchSelectFunctor<GPUDevice, T> {
     const int all_but_batch = then_flat_outer_dims.dimension(1);

 #if !defined(EIGEN_HAS_INDEX_LIST)
-    Eigen::array<int, 2> broadcast_dims{{ 1, all_but_batch }};
-    Eigen::Tensor<int, 2>::Dimensions reshape_dims{{ batch, 1 }};
+    // Eigen::array<int, 2> broadcast_dims{{ 1, all_but_batch }};
+    Eigen::array<int, 2> broadcast_dims;
+   broadcast_dims[0] = 1;
+    broadcast_dims[1] = all_but_batch;
+    // Eigen::Tensor<int, 2>::Dimensions reshape_dims{{ batch, 1 }};
+    Eigen::Tensor<int, 2>::Dimensions reshape_dims;
+   reshape_dims[0] = batch;
+   reshape_dims[1] = 1;
 #else
     Eigen::IndexList<Eigen::type2index<1>, int> broadcast_dims;
     broadcast_dims.set(1, all_but_batch);
diff --git a/tensorflow/core/kernels/sparse_tensor_dense_matmul_op_gpu.cu.cc b/tensorflow/core/kernels/sparse_tensor_dense_matmul_op_gpu.cu.cc
index a177696..28d2f59 100644
--- a/tensorflow/core/kernels/sparse_tensor_dense_matmul_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/sparse_tensor_dense_matmul_op_gpu.cu.cc
@@ -104,9 +104,17 @@ struct SparseTensorDenseMatMulFunctor<GPUDevice, T, ADJ_A, ADJ_B> {
     int n = (ADJ_B) ? b.dimension(0) : b.dimension(1);

 #if !defined(EIGEN_HAS_INDEX_LIST)
-    Eigen::Tensor<int, 2>::Dimensions matrix_1_by_nnz{{ 1, nnz }};
-    Eigen::array<int, 2> n_by_1{{ n, 1 }};
-    Eigen::array<int, 1> reduce_on_rows{{ 0 }};
+    // Eigen::Tensor<int, 2>::Dimensions matrix_1_by_nnz{{ 1, nnz }};
+    Eigen::Tensor<int, 2>::Dimensions matrix_1_by_nnz;
+   matrix_1_by_nnz[0] = 1;
+   matrix_1_by_nnz[1] = nnz;
+    // Eigen::array<int, 2> n_by_1{{ n, 1 }};
+    Eigen::array<int, 2> n_by_1;
+   n_by_1[0] = n;
+   n_by_1[1] = 1;
+    // Eigen::array<int, 1> reduce_on_rows{{ 0 }};
+    Eigen::array<int, 1> reduce_on_rows;
+   reduce_on_rows[0]= 0;
 #else
     Eigen::IndexList<Eigen::type2index<1>, int> matrix_1_by_nnz;
     matrix_1_by_nnz.set(1, nnz);
diff --git a/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc b/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc
index 52256a7..1d027b9 100644
--- a/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc
+++ b/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc
@@ -888,6 +888,9 @@ CudaContext* CUDAExecutor::cuda_context() { return context_; }
 // For anything more complicated/prod-focused than this, you'll likely want to
 // turn to gsys' topology modeling.
 static int TryToReadNumaNode(const string &pci_bus_id, int device_ordinal) {
+// DC - make this clever later. ARM has no NUMA node, just return 0
+LOG(INFO) << "ARM has no NUMA node, hardcoding to return zero";
+return 0;
 #if defined(__APPLE__)
   LOG(INFO) << "OS X does not support NUMA - returning NUMA node zero";
   return 0;

build

$ ./configure
$ bazel build -c opt --jobs 2 --local_resources 1024,4.0,1.0 --config=cuda //tensorflow/tools/pip_package:build_pip_package

References

@elirex I'm pretty sure you're still running out of memory even with the --local_resources flag. Try adding some swap space

@tylerfox I tried that, but it doesn't work.

After enabling the swap space, I'd still opt for more memory and less cpu when building. Try something like bazel build -c opt --local_resources 3072,0.5,1.0 --verbose_failures --config=cuda //tensorflow/tools/pip_package:build_pip_package

Note this will take several hours to build with these settings, but it's the only way I've found to work. Hope that helps.

@elirex, Hi, you mentioned "Modify some files for build on aarch64", I don't know which files need to be modified and how?
similar description in tensorflow-on-nvidia-tx1 "Need an edit to recognize aarch64 as ARM".
Thanks!

@ShawnXuan - these are files in the cloned bazel repo. The change proposed on StackOverflow for example would be made to CPU.java as shown in the diff. You can see which files elirex changed in addition by looking at their diff. Hope that helps

@elirex Did you manage to compile ?

@piotrchmiel Yes, I successfully completed the compilation. I add 8GB swap space and run bazel build -c opt --local_resources 3072,4.0,1.0 --verbose_failures --config=cuda //tensorflow/tools/pip_package:build_pip_package

At compiling, I through free -h and top command to look the memory usage status. Tensorflow need to use about 8GB memory to compile.

Thank you 👍 I will try to repeat your steps :-)

Question:

For those that compiled TensorFlow 0.9 on the Jetson TX1, which options did you use during the TensorFlow ./configure step?

Error 1:

I received Error: unexpected EOF from Bazel server after following the steps from this StackOverflow guide from a fresh install of JetPack 2.3.

Two bazel issue responders (1, 2) suggested people use the --jobs 4 or --jobs 20 option when receiving this error, in case the error was due to a lack of memory.

I'm ran bazel again, this time with the --jobs 4; however, I received a new error ("Error 2", below).

The remainder of the error said, Contents of /home/ubuntu/.cache/bazel/_bazel_ubuntu/(xxxx)/server/jvm.out':` with no further output.

Error 2:

ERROR: /home/ubuntu/tensorflow/tensorflow/core/kernels/BUILD:309:1: C++ compilation of rule '//tensorflow/core/kernels:mirror_pad_op' failed: crosstool_wrapper_driver_is_not_gcc failed: error executing command third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -fPIE -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object ... (remaining 105 argument(s) skipped): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 4. gcc: internal compiler error: Killed (program cc1plus)

I didn't use bazel clean --expunge before the second attempt. Maybe that caused the error.

Plan:

  • Run bazel clean --expunge
  • Rerun bazel to create the cache folder
  • Readd config.guess and config.sub to the cache folder
  • Create 8GB of swap space
  • Try bazel build -c opt --local_resources 3072,4.0,1.0 --verbose_failures --config=cuda //tensorflow/tools/pip_package:build_pip_package because @elirex had success with it.

It worked

Following this StackOverflow guide but with an 8 GB swap file and using the following command successfully built TensorFlow 0.9 on the Jetson TX1 from a fresh install of JetPack 2.3:

bazel build -c opt --local_resources 3072,4.0,1.0 --verbose_failures --config=cuda //tensorflow/tools/pip_package:build_pip_package

I used the default settings for TensorFlow's ./configure script except to enable GPU support.

My build took at least 6 hours. It'll be faster if you use an SSD instead of a USB drive.

Thanks to Dwight Crow, @elirex, @tylerfox, everyone that helped them, and everyone in this thread for spending time on this problem.

Creating a swap file

# Create a swapfile for Ubuntu at the current directory location
fallocate -l *G swapfile
# List out the file
ls -lh swapfile
# Change permissions so that only root can use it
chmod 600 swapfile
# List out the file
ls -lh swapfile
# Set up the Linux swap area
mkswap swapfile
# Now start using the swapfile
sudo swapon swapfile
# Show that it's now being used
swapon -s

Adapted from JetsonHack's gist.

I used this USB drive to store my swap file.

The most memory I saw my system use was 7.7 GB (3.8 GB on Mem and 3.9 GB on Swap). The most swap memory I saw used at once was 4.4 GB. I used free -h to view memory usage.

Creating the pip package and installing

Adapted from the TensorFlow docs:

$ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

# The name of the .whl file will depend on your platform.
$ pip install /tmp/tensorflow_pkg/tensorflow-0.9.0-py2-none-any.whl

I use bazel build -c opt --local_resources 1024,4.0,1.0 --jobs 4 --verbose_failures --config=cuda //tensorflow/tools/pip_package:build_pip_package and without allocate swap to build tensorflow r.09 on the TX1 JetPack 2.3 that pass compilation.

Could anyone build TF r0.11 on TX1 yet?

Thanks for all the information here, got tensorflow r0.11.0 installed with Jetpack 2.3.1 on tx1. Following @elirex 's steps, make sure using the exact version of protobuf, grpc and bazel. I build tensorflow r0.11.0 instead of v0.11.0.rc2. When compiling, following @MatthewKleinsmith 's step to add swap file, you need a big swap, I tried 6G but failed in the middle with out of memory error, tried again with 10G swap file works. It took me about 5 hours for the compiling with swapfile allocated on usb drive.

Is tensorflow working correctly on the TX1, ie. able to run inference and get good results? When I installed tensorflow on a TK1 it ran just fine however the convolutional layers where producing bad results. I could train fully connected models on mnist just fine but when I tried to use conv layers it stopped converging. Is this problem persistent in the TX1 build?

Continually get this when running ./compile.sh for Bazel:
Building Bazel from scratch
gPRC Java plugin not found in

If I pull 0.2.3 I don't get the error, only with 0.3.x

@zxwind How is TF 0.11 performance working for you on the TX1?

FYI, I've got a branch off r1.0 with some hacks to build the r1.0 release on TX1 with Jetpack 2.3.1.

In addition to the previously mentioned issues, there is a change in Eigen after the revision used on the TF r0.11 branch that causes the CUDA compiler to crash with an internal error. I changed workspace.bzl on r1.0 branch to point to the older Eigen revision. In order for that to build I had to remove the EXPM1 op that was added after r0.11. It's all rather ugly but got me up and running.

Interesting to note, with the r1.0.0a build I'm able to run inference on a Resnet50 based network at 128x96 resolution that was running out of memory on r0.11. For anyone curious on benchmark numbers, was getting approx 15fps with single frame batches.

Link to a tag on my clone of TF with binary wheels for anyone interested. The wheels will likely only work on a Jetpack 2.3.1 (L4T 24.2.1). No guarantees there aren't some serious issues but I've verified results on the networks I'm using right now.
https://github.com/rwightman/tensorflow/releases/tag/v1.0.0-alpha-tegra-ugly_hack

Closing since @rwightman / @MatthewKleinsmith solution seems to work, though not quite a seamless out-the-box experience. Feel free to reopen.

@rwightman May I humbly ask you to provide another wheel for the r1.0 stable version?

@rwightman How were you able to build tensorflow without gRPC? Thanks!

Edit: never mind, I saw your repo : https://github.com/jetsonhacks/installTensorFlowTX1/

Thanks for setting that up.

@sunsided Here's the Python 3.5.2 version for TF 1.0.1 that @dkopljar and I managed to build: https://drive.google.com/open?id=0B2jw9AHXtUJ_OFJDV19TWTEyaWc

Hello all, I was able to install TensorFlow v1.0.1 on the new Jetson TX2. I had to follow similar process as mentioned above in this thread (protobuf, grpc, swapfile etc). For bazel, I downloaded bazel-0.4.5-dist.zip and applied @dtrebbien's change. Here is the pip wheel of my installation if it helps anyone. It's for Python 2.7: https://drive.google.com/file/d/0Bxl-G9VJ61mBYmZPY0hLSlFaUDg/view?usp=sharing
And here the step by step procedure: https://syed-ahmed.gitbooks.io/nvidia-jetson-tx2-recipes/content/first-question.html

Hello all, I was able to install TensorFlow v1.0.1 on Tegra X1 using the build by @Barty777
Is there build availabke for TensorFlow v1.2 ?

@Barty777 you wouldn't happen to have 3.6 wheels, would you? 🙏

@gvoysey Unfortunately no. :(

Here is the wheel file for TensorFlow 1.2, Nvidia TX1 and Python 2.7: https://drive.google.com/file/d/0B-Ljdh8jFZRbTnVNdGtGMHA2Ymc/view?usp=sharing

i've been able to build a tensorflow wheel for python 3.6 for TX1, but i cannot build tensorflow-GPU support successfully. See https://stackoverflow.com/questions/45825708/error-building-tensorflow-gpu-1-1-0-on-nvidia-jetson-tx1-aarch64 for details.

Sorry for the late comment, can anyone please help me regarding setting up tensorflow in Nvidia tk1?