Docker builds failing on Jetson NX

Question

Docker builds failing on Jetson NX

javadan opened this issue 3 years ago · comments

Daniel commented 3 years ago

Hi, I'm not sure if this is meant to work on the NX too, or if you've only tested on Nanos?

I've cloned the repo to the NX, and tried building with

make tensorflow_opencv
and then
make cudnn_tensorflow_opencv

The first gave:

Step 24/42 : RUN curl -s -Lo /usr/local/bin/bazel https://github.com/bazelbuild/bazelisk/releases/download/v${LATEST_BAZELISK}/bazelisk-linux-amd64   && chmod +x /usr/local/bin/bazel   && mkdir -p /usr/local/src/tensorflow   && cd /usr/local/src   && wget -q --no-check-certificate -c https://github.com/tensorflow/tensorflow/archive/v${CTO_TENSORFLOW_VERSION}.tar.gz -O - | tar --strip-components=1 -xz -C /usr/local/src/tensorflow   && cd /usr/local/src/tensorflow   && fgrep _TF_MAX_BAZEL configure.py | grep '=' | perl -ne '$lb="'${LATEST_BAZEL}'";$brv=$1 if (m%\=\s+.([\d\.]+).$+%); sub numit{@g=split(m%\.%,$_[0]);return(1000000*$g[0]+1000*$g[1]+$g[2]);}; if (&numit($brv) > &numit($lb)) { print "$lb" } else {print "$brv"};' > .bazelversion   && bazel clean   && chmod +x /tmp/tf_build.sh   && time /tmp/tf_build.sh ${CTO_TF_CUDNN} ${CTO_TF_OPT}   && time ./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg   && time pip3 install /tmp/tensorflow_pkg/tensorflow-*.whl   && rm -rf /usr/local/src/tensorflow /tmp/tensorflow_pkg /tmp/bazel_check.pl /tmp/tf_build.sh /tmp/hsperfdata_root /root/.cache/bazel /root/.cache/pip /root/.cache/bazelisk
 ---> Running in 3059bbf7dc96
/bin/sh: 1: bazel: Exec format error
The command '/bin/sh -c curl -s -Lo /usr/local/bin/bazel https://github.com/bazelbuild/bazelisk/releases/download/v${LATEST_BAZELISK}/bazelisk-linux-amd64   && chmod +x /usr/local/bin/bazel   && mkdir -p /usr/local/src/tensorflow   && cd /usr/local/src   && wget -q --no-check-certificate -c https://github.com/tensorflow/tensorflow/archive/v${CTO_TENSORFLOW_VERSION}.tar.gz -O - | tar --strip-components=1 -xz -C /usr/local/src/tensorflow   && cd /usr/local/src/tensorflow   && fgrep _TF_MAX_BAZEL configure.py | grep '=' | perl -ne '$lb="'${LATEST_BAZEL}'";$brv=$1 if (m%\=\s+.([\d\.]+).$+%); sub numit{@g=split(m%\.%,$_[0]);return(1000000*$g[0]+1000*$g[1]+$g[2]);}; if (&numit($brv) > &numit($lb)) { print "$lb" } else {print "$brv"};' > .bazelversion   && bazel clean   && chmod +x /tmp/tf_build.sh   && time /tmp/tf_build.sh ${CTO_TF_CUDNN} ${CTO_TF_OPT}   && time ./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg   && time pip3 install /tmp/tensorflow_pkg/tensorflow-*.whl   && rm -rf /usr/local/src/tensorflow /tmp/tensorflow_pkg /tmp/bazel_check.pl /tmp/tf_build.sh /tmp/hsperfdata_root /root/.cache/bazel /root/.cache/pip /root/.cache/bazelisk' returned a non-zero code: 2
Makefile:202: recipe for target 'actual_build' failed
make[3]: *** [actual_build] Error 2
make[3]: Leaving directory '/home/chicken/datamachines/cuda_tensorflow_opencv'
Makefile:145: recipe for target 'build_prep' failed
make[2]: *** [build_prep] Error 2
make[2]: Leaving directory '/home/chicken/datamachines/cuda_tensorflow_opencv'
Makefile:138: recipe for target 'tensorflow_opencv-1.15.5_3.4.14' failed
make[1]: *** [tensorflow_opencv-1.15.5_3.4.14] Error 2
make[1]: Leaving directory '/home/chicken/datamachines/cuda_tensorflow_opencv'
Makefile:132: recipe for target 'tensorflow_opencv' failed
make: *** [tensorflow_opencv] Error 2

And the second gave:

Step 4/43 : RUN apt-get update -y --fix-missing  && apt-get install -y --no-install-recommends     apt-utils     locales     wget     ca-certificates   && apt-get clean
 ---> Running in 52221b94691a
standard_init_linux.go:211: exec user process caused "exec format error"
The command '/bin/sh -c apt-get update -y --fix-missing  && apt-get install -y --no-install-recommends     apt-utils     locales     wget     ca-certificates   && apt-get clean' returned a non-zero code: 1
Makefile:202: recipe for target 'actual_build' failed
make[3]: *** [actual_build] Error 1
make[3]: Leaving directory '/home/chicken/datamachines/cuda_tensorflow_opencv'
Makefile:145: recipe for target 'build_prep' failed
make[2]: *** [build_prep] Error 2
make[2]: Leaving directory '/home/chicken/datamachines/cuda_tensorflow_opencv'
Makefile:141: recipe for target 'cudnn_tensorflow_opencv-9.2_1.15.5_3.4.14' failed
make[1]: *** [cudnn_tensorflow_opencv-9.2_1.15.5_3.4.14] Error 2
make[1]: Leaving directory '/home/chicken/datamachines/cuda_tensorflow_opencv'
Makefile:135: recipe for target 'cudnn_tensorflow_opencv' failed
make: *** [cudnn_tensorflow_opencv] Error 2

Did I miss a step?

Martial Michel · Answer 1 · Tue May 11 2021 07:36:18 GMT+0800 (China Standard Time)

First, yes we have only tested on JetsonNano, I do not have an NX for testing the build.

That said, looking at the make result, it appears that you are running the amd64 version of the code, running the Makefile in datamachines/cuda_tensorflow_opencv.
If you look in the directory tree, you will see a JetsonNano directory that contains the appropriate Makefile.

Looking at https://developer.nvidia.com/embedded/jetpack you should be able to build it without a problem, as the JetPack version supports the NX as well.
You might also be able to directly use the DockerHub ready version to test (linked in the JetsonNano's README.md)

Hoping this helps

Daniel · Answer 2 · Thu May 13 2021 06:47:06 GMT+0800 (China Standard Time)

Great, thank you for the useful project.

I have managed to get it working now, with the docker hub solution.

I'll carry on with the Docker hub method,

I couldn't build it myself however Maybe the Pytorch installation details have changed. Anyway, I'm sorted for now, thanks

$/datamachines/cuda_tensorflow_opencv/JetsonNano$ make jetsonnano-cuda_tensorflow_opencv-10.2_2.3_4.5.1

Step 25/41 : RUN pip3 install -U jupyter
 ---> Using cache
 ---> feabcd9dbb0b
Step 26/41 : RUN cd /tmp   && wget -q --no-check-certificate https://nvidia.box.com/shared/static/cs3xn3td6sfgtene6jdvsxlr366m2dhq.whl -O torch-1.7.0-cp36-cp36m-linux_aarch64.whl   && pip3 install torch-1.7.0-cp36-cp36m-linux_aarch64.whl   && rm -rf /root/.cache/pip torch-1.7.0-cp36-cp36m-linux_aarch64.whl
 ---> Using cache
 ---> e2827fb36265
Step 27/41 : RUN mkdir -p /usr/local/src/torchvision   && wget -q --no-check-certificate https://github.com/pytorch/vision/archive/v0.8.2.tar.gz -O - | tar --strip-components=1 -xz -C /usr/local/src/torchvision   && cd /usr/local/src/torchvision   && python3 setup.py install    && rm -rf /root/.cache/pip /usr/local/src/torchvision
 ---> Running in 48b59dbde63c
Traceback (most recent call last):
  File "setup.py", line 12, in <module>
    import torch
  File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 189, in <module>
    _load_global_deps()
  File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 142, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/usr/lib/python3.6/ctypes/__init__.py", line 348, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcurand.so.10: cannot open shared object file: No such file or directory
The command '/bin/sh -c mkdir -p /usr/local/src/torchvision   && wget -q --no-check-certificate https://github.com/pytorch/vision/archive/v0.8.2.tar.gz -O - | tar --strip-components=1 -xz -C /usr/local/src/torchvision   && cd /usr/local/src/torchvision   && python3 setup.py install    && rm -rf /root/.cache/pip /usr/local/src/torchvision' returned a non-zero code: 1
Makefile:95: recipe for target 'actual_build' failed
make[2]: *** [actual_build] Error 1
make[2]: Leaving directory '/home/chicken/datamachines/cuda_tensorflow_opencv/JetsonNano'
Makefile:68: recipe for target 'build_prep' failed
make[1]: *** [build_prep] Error 2
make[1]: Leaving directory '/home/chicken/datamachines/cuda_tensorflow_opencv/JetsonNano'
Makefile:65: recipe for target 'jetsonnano-cuda_tensorflow_opencv-10.2_2.3_4.5.1' failed
make: *** [jetsonnano-cuda_tensorflow_opencv-10.2_2.3_4.5.1] Error 2

Martial Michel · Answer 3 · Thu May 13 2021 07:04:24 GMT+0800 (China Standard Time)

Glad it is working for you.

I see in the log the libcurand.so.10: cannot open shared object file: No such file or directory which is strange.
When I get a chance, I will investigate further.

Closing this Issue given you are able to do what you need :)