Docker builds failing on Jetson NX
javadan opened this issue · comments
Hi, I'm not sure if this is meant to work on the NX too, or if you've only tested on Nanos?
I've cloned the repo to the NX, and tried building with
make tensorflow_opencv
and then
make cudnn_tensorflow_opencv
The first gave:
Step 24/42 : RUN curl -s -Lo /usr/local/bin/bazel https://github.com/bazelbuild/bazelisk/releases/download/v${LATEST_BAZELISK}/bazelisk-linux-amd64 && chmod +x /usr/local/bin/bazel && mkdir -p /usr/local/src/tensorflow && cd /usr/local/src && wget -q --no-check-certificate -c https://github.com/tensorflow/tensorflow/archive/v${CTO_TENSORFLOW_VERSION}.tar.gz -O - | tar --strip-components=1 -xz -C /usr/local/src/tensorflow && cd /usr/local/src/tensorflow && fgrep _TF_MAX_BAZEL configure.py | grep '=' | perl -ne '$lb="'${LATEST_BAZEL}'";$brv=$1 if (m%\=\s+.([\d\.]+).$+%); sub numit{@g=split(m%\.%,$_[0]);return(1000000*$g[0]+1000*$g[1]+$g[2]);}; if (&numit($brv) > &numit($lb)) { print "$lb" } else {print "$brv"};' > .bazelversion && bazel clean && chmod +x /tmp/tf_build.sh && time /tmp/tf_build.sh ${CTO_TF_CUDNN} ${CTO_TF_OPT} && time ./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg && time pip3 install /tmp/tensorflow_pkg/tensorflow-*.whl && rm -rf /usr/local/src/tensorflow /tmp/tensorflow_pkg /tmp/bazel_check.pl /tmp/tf_build.sh /tmp/hsperfdata_root /root/.cache/bazel /root/.cache/pip /root/.cache/bazelisk
---> Running in 3059bbf7dc96
/bin/sh: 1: bazel: Exec format error
The command '/bin/sh -c curl -s -Lo /usr/local/bin/bazel https://github.com/bazelbuild/bazelisk/releases/download/v${LATEST_BAZELISK}/bazelisk-linux-amd64 && chmod +x /usr/local/bin/bazel && mkdir -p /usr/local/src/tensorflow && cd /usr/local/src && wget -q --no-check-certificate -c https://github.com/tensorflow/tensorflow/archive/v${CTO_TENSORFLOW_VERSION}.tar.gz -O - | tar --strip-components=1 -xz -C /usr/local/src/tensorflow && cd /usr/local/src/tensorflow && fgrep _TF_MAX_BAZEL configure.py | grep '=' | perl -ne '$lb="'${LATEST_BAZEL}'";$brv=$1 if (m%\=\s+.([\d\.]+).$+%); sub numit{@g=split(m%\.%,$_[0]);return(1000000*$g[0]+1000*$g[1]+$g[2]);}; if (&numit($brv) > &numit($lb)) { print "$lb" } else {print "$brv"};' > .bazelversion && bazel clean && chmod +x /tmp/tf_build.sh && time /tmp/tf_build.sh ${CTO_TF_CUDNN} ${CTO_TF_OPT} && time ./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg && time pip3 install /tmp/tensorflow_pkg/tensorflow-*.whl && rm -rf /usr/local/src/tensorflow /tmp/tensorflow_pkg /tmp/bazel_check.pl /tmp/tf_build.sh /tmp/hsperfdata_root /root/.cache/bazel /root/.cache/pip /root/.cache/bazelisk' returned a non-zero code: 2
Makefile:202: recipe for target 'actual_build' failed
make[3]: *** [actual_build] Error 2
make[3]: Leaving directory '/home/chicken/datamachines/cuda_tensorflow_opencv'
Makefile:145: recipe for target 'build_prep' failed
make[2]: *** [build_prep] Error 2
make[2]: Leaving directory '/home/chicken/datamachines/cuda_tensorflow_opencv'
Makefile:138: recipe for target 'tensorflow_opencv-1.15.5_3.4.14' failed
make[1]: *** [tensorflow_opencv-1.15.5_3.4.14] Error 2
make[1]: Leaving directory '/home/chicken/datamachines/cuda_tensorflow_opencv'
Makefile:132: recipe for target 'tensorflow_opencv' failed
make: *** [tensorflow_opencv] Error 2
And the second gave:
Step 4/43 : RUN apt-get update -y --fix-missing && apt-get install -y --no-install-recommends apt-utils locales wget ca-certificates && apt-get clean
---> Running in 52221b94691a
standard_init_linux.go:211: exec user process caused "exec format error"
The command '/bin/sh -c apt-get update -y --fix-missing && apt-get install -y --no-install-recommends apt-utils locales wget ca-certificates && apt-get clean' returned a non-zero code: 1
Makefile:202: recipe for target 'actual_build' failed
make[3]: *** [actual_build] Error 1
make[3]: Leaving directory '/home/chicken/datamachines/cuda_tensorflow_opencv'
Makefile:145: recipe for target 'build_prep' failed
make[2]: *** [build_prep] Error 2
make[2]: Leaving directory '/home/chicken/datamachines/cuda_tensorflow_opencv'
Makefile:141: recipe for target 'cudnn_tensorflow_opencv-9.2_1.15.5_3.4.14' failed
make[1]: *** [cudnn_tensorflow_opencv-9.2_1.15.5_3.4.14] Error 2
make[1]: Leaving directory '/home/chicken/datamachines/cuda_tensorflow_opencv'
Makefile:135: recipe for target 'cudnn_tensorflow_opencv' failed
make: *** [cudnn_tensorflow_opencv] Error 2
Did I miss a step?
First, yes we have only tested on JetsonNano
, I do not have an NX for testing the build.
That said, looking at the make
result, it appears that you are running the amd64
version of the code, running the Makefile
in datamachines/cuda_tensorflow_opencv
.
If you look in the directory tree, you will see a JetsonNano
directory that contains the appropriate Makefile
.
Looking at https://developer.nvidia.com/embedded/jetpack you should be able to build it without a problem, as the JetPack version supports the NX as well.
You might also be able to directly use the DockerHub ready version to test (linked in the JetsonNano
's README.md
)
Hoping this helps
Great, thank you for the useful project.
I have managed to get it working now, with the docker hub solution.
I'll carry on with the Docker hub method,
I couldn't build it myself however Maybe the Pytorch installation details have changed. Anyway, I'm sorted for now, thanks
$/datamachines/cuda_tensorflow_opencv/JetsonNano$ make jetsonnano-cuda_tensorflow_opencv-10.2_2.3_4.5.1
Step 25/41 : RUN pip3 install -U jupyter
---> Using cache
---> feabcd9dbb0b
Step 26/41 : RUN cd /tmp && wget -q --no-check-certificate https://nvidia.box.com/shared/static/cs3xn3td6sfgtene6jdvsxlr366m2dhq.whl -O torch-1.7.0-cp36-cp36m-linux_aarch64.whl && pip3 install torch-1.7.0-cp36-cp36m-linux_aarch64.whl && rm -rf /root/.cache/pip torch-1.7.0-cp36-cp36m-linux_aarch64.whl
---> Using cache
---> e2827fb36265
Step 27/41 : RUN mkdir -p /usr/local/src/torchvision && wget -q --no-check-certificate https://github.com/pytorch/vision/archive/v0.8.2.tar.gz -O - | tar --strip-components=1 -xz -C /usr/local/src/torchvision && cd /usr/local/src/torchvision && python3 setup.py install && rm -rf /root/.cache/pip /usr/local/src/torchvision
---> Running in 48b59dbde63c
Traceback (most recent call last):
File "setup.py", line 12, in <module>
import torch
File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 189, in <module>
_load_global_deps()
File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 142, in _load_global_deps
ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
File "/usr/lib/python3.6/ctypes/__init__.py", line 348, in __init__
self._handle = _dlopen(self._name, mode)
OSError: libcurand.so.10: cannot open shared object file: No such file or directory
The command '/bin/sh -c mkdir -p /usr/local/src/torchvision && wget -q --no-check-certificate https://github.com/pytorch/vision/archive/v0.8.2.tar.gz -O - | tar --strip-components=1 -xz -C /usr/local/src/torchvision && cd /usr/local/src/torchvision && python3 setup.py install && rm -rf /root/.cache/pip /usr/local/src/torchvision' returned a non-zero code: 1
Makefile:95: recipe for target 'actual_build' failed
make[2]: *** [actual_build] Error 1
make[2]: Leaving directory '/home/chicken/datamachines/cuda_tensorflow_opencv/JetsonNano'
Makefile:68: recipe for target 'build_prep' failed
make[1]: *** [build_prep] Error 2
make[1]: Leaving directory '/home/chicken/datamachines/cuda_tensorflow_opencv/JetsonNano'
Makefile:65: recipe for target 'jetsonnano-cuda_tensorflow_opencv-10.2_2.3_4.5.1' failed
make: *** [jetsonnano-cuda_tensorflow_opencv-10.2_2.3_4.5.1] Error 2
Glad it is working for you.
I see in the log the libcurand.so.10: cannot open shared object file: No such file or directory
which is strange.
When I get a chance, I will investigate further.
Closing this Issue
given you are able to do what you need :)