CNN-Inference-Engine-Quick-View
A quick view of high-performance convolution neural networks (CNNs) inference engines on mobile devices.
Runtime-speed Comparisons
Framework |
Main Platform |
Model Compatibility |
Detection-Support |
Speed Benchmarks |
Intel-Caffe |
CPU (Intel optimized) |
Caffe |
Y |
Link |
NCNN |
CPU (ARM optimized) |
Caffe / pytorch / mxnet / onnx |
Y |
Link / unofficial Link |
FeatherCNN |
CPU (ARM optimized) |
Caffe |
N |
Link / unofficial Link |
FeatherCNNEx |
CPU (ARM optimized) |
Caffe |
N |
Link |
Tengine |
CPU (ARM A72 optimized) |
Caffe / mxnet |
Y |
Link |
Tensorflowlite |
CPU (Android optimized) |
Caffe2 / Tensorflow / onnx |
Y |
Link |
TensorRT |
GPU (Volta optimized) |
Caffe / Tensorflow / onnx |
Y |
Link |
TVM |
CPU (ARM optimized) / Mali GPU / FPGA |
onnx |
Y |
- |
SNPE |
CPU (Qualcomm optimized) / GPU / DSP |
Caffe / Caffe2 / Tensorflow/ onnx |
Y |
Link |
MACE |
CPU (ARM optimized) / Mali GPU / DSP |
Caffe / Tensorflow / onnx |
Y |
Link |
Easy-MACE |
CPU (ARM optimized) / CPU (x86 optimized) |
Caffe / Tensorflow / onnx |
Y |
- |
In-Prestissimo |
CPU (ARM optimized) |
Caffe |
N |
Link |
Paddle-Mobile |
CPU (ARM optimized) / Mali GPU / FPGA |
Paddle / Caffe / onnx |
Y |
- |
Anakin |
CPU (ARM optimized) / GPU / CPU (x86 optimized) |
Caffe / Fluid |
Y |
Link |
Pocket-Tensor |
CPU (ARM/x86 optimized) |
Keras |
N |
Link |
ZQCNN |
CPU |
Caffe / mxnet |
Y |
Link |
ARM-NEON-to-x86-SSE |
CPU (Intel optimized) |
Intrinsics-Level |
- |
- |
Simd |
CPU (all platform optimized) |
Intrinsics-Level |
- |
- |
clDNN |
Intel® Processor Graphics / Iris™ Pro Graphics |
Caffe / Tennsorflow / mxnet / onnx |
Y |
Link |
Framework |
Main Platform |
Model Compatibility |
Detection-Support |
Speed Benchmarks |
Intel-Caffe |
CPU (Intel Skylake) |
Caffe |
Y |
Link |
NCNN |
CPU (ARM) |
Caffe / pytorch / mxnet / onnx |
Y |
Link |
Tensorflowlite |
CPU (Android) |
Caffe2 / Tensorflow / onnx |
Y |
Link |
TensorRT |
GPU (Volta) |
Caffe / Tensorflow / onnx |
Y |
Link |
Gemmlowp |
CPU (ARM / x86) |
GEMM Library |
- |
- |
SNPE |
DSP (Quantized DLC) |
Caffe / Caffe2 / Tensorflow/ onnx |
Y |
Link |
MACE |
CPU (ARM optimized) / Mali GPU / DSP |
Caffe / Tensorflow / onnx |
Y |
Link |
In-Prestissimo |
CPU (ARM optimized) |
Caffe |
N |
Link |
Paddle-Mobile |
CPU (ARM optimized) / Mali GPU / FPGA |
Paddle / Caffe / onnx |
Y |
- |
Anakin |
CPU (ARM optimized) / GPU / CPU (x86 optimized) |
Caffe / Fluid |
Y |
Link |
Framework |
Main Platform |
Model Compatibility |
Detection-Support |
Speed Benchmarks |
Gemmbitserial |
CPU (ARM / x86) |
GEMM Library |
- |
Link |
MobileNet-v1 Speed Benchmarks on RK3399
Rockchip RK3399 (Cortex-A72 1.8GHz x 2 + Cortex-A53 1.5GHz x 4):
Framework (ms) |
1 Thread |
2 Threads |
3 Threads |
4 Threads |
Caffe+OpenBLAS* |
250.57 |
204.40 |
248.65 |
230.20 |
FeatherCNN |
205.76 |
135.17 |
183.34 |
194.67 |
NCNN** |
150.95 |
90.79 |
232.31 |
231.64 |
NCNN-Opt |
122.22 |
67.47 |
- |
- |
Tengine |
122.10 |
65.42 |
- |
- |
Tengine-Opt |
115.29 |
63.94 |
- |
- |
*
: optimized for Cortex-A53 instead of Cortex-A72
**
: powersave=0
For 1 Thread, we set task on a single A72, and A72 x 2 for 2 Threads.
ResNet-18 Speed Benchmarks on RK3399
Framework (ms) |
1 Thread |
2 Threads |
8 Threads |
NCNN* |
340.33 |
211.78 |
- |
NCNN-Opt |
332.20 |
206.62 |
196.97 |
Tengine |
402.57 |
226.02 |
- |
*
: Conv-BN-Scale-fused