cknowledge.org/ai: Crowdsourcing benchmarking and optimisation of AI

[PUBLIC] Benchmarking Caffe and TensorRT on NVIDIA Jetson TX1

NB: The Caffe experimental results are released with approval from General Motors. The TensorRT 1.0 EA experimental results are released with approval from NVIDIA.

The Jupyter notebook (view on github.com; view on nbviewer.jupyter.org) in this Collective Knowledge repository analyses the performance (execution time, memory consumption):

on dividiti's Jetson TX1 board (official page, Phoronix review):
- CPU:
  - ARM® Cortex®-A57 architecture ("big");
  - 4 cores;
  - Max clock 1743 MHz;
- GPU:
  - Maxwell™ architecture;
  - 256 CUDA cores;
  - Max clock 998 MHz;
- RAM:
  - LPDDR4;
  - 4 GB (shared between the CPU and the GPU);
  - Max bandwidth 25.6 GB/s;
- Linux for Tegra 24.2.1;
- JetPack 2.3.1;
- CUDA Toolkit 8.0.33.

$ uname -a
Linux tegra-ubuntu 3.10.96-tegra #1 SMP PREEMPT Wed Nov 9 19:42:57 PST 2016 aarch64 aarch64 aarch64 GNU/Linux
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.1 LTS"

using 6 Caffe libraries:
- [tag] Branch (revision hash, date): math libraries.
- [cpu] Master (24d2f67, 28/Nov/2016): with OpenBLAS 0.2.19;
- [nvidia-cuda] NVIDIA 0.15 (1024d34, 17/Nov/2016): with cuBLAS (part of CUDA Toolkit 8.0.33);
- [nvidia-cudnn] NVIDIA 0.15 (1024d34, 17/Nov/2016): with cuDNN 5.1;
- [nvidia-fp16-cuda] NVIDIA experimental/fp16 (fca1cf4, 11/Jul/2016): with cuBLAS (part of CUDA Toolkit 8.0.33);
- [nvidia-fp16-cudnn] NVIDIA experimental/fp16 (fca1cf4, 11/Jul/2016): with cuDNN 5.1;
- [libdnn-cuda] OpenCL (b735c2d, 23/Nov/2016): with libDNN and cuBLAS (part of CUDA Toolkit 8.0.33) for fully connected layers;
NB: libDNN is not yet tuned for TX1 - it uses parameters that are optimal for GTX 1080.
using 2 configurations of the NVIDIA TensorRT 1.0.0 EA engine:
- [tensorrt-fp16] NVIDIA TensorRT 1.0.0 EA with fp16 enabled;
- [tensorrt-fp32] NVIDIA TensorRT 1.0.0 EA with fp16 disabled;
NB: This EA ("early access") version is used in accordance with its special licensing terms: the results are released with explicit written approval from NVIDIA. The results may not be representative of the GA ("general availability") version.
using 4 CNN models:
- GoogleNet;
- AlexNet;
- SqueezeNet 1.0;
- SqueezeNet 1.1;
with the batch size varying from 2 to 16 with step 2.

dividiti / ck-caffe-nvidia-tx1

cknowledge.org/ai: Crowdsourcing benchmarking and optimisation of AI

[PUBLIC] Benchmarking Caffe and TensorRT on NVIDIA Jetson TX1

About

Languages