Performance optimisation

Question

Performance optimisation

lemberh opened this issue 5 years ago · comments

Roman Nazarevych commented 5 years ago

Do you support neon instruction set optimization from this thread davisking/dlib#276

Luca Anzalone · Answer 1 · Fri Nov 29 2019 19:05:18 GMT+0800 (China Standard Time)

Hi @lemberh, unfortunately I support only the basic architectures.

Roman Nazarevych · Answer 2 · Sat Nov 30 2019 09:02:17 GMT+0800 (China Standard Time)

Did you try to use those optimizations?
I have changed cmake to :

  ${AndroidCmake}   -DBUILD_SHARED_LIBS=1 \
	  -DANDROID_NDK=${NDK} \
	  -DCMAKE_SYSTEM_NAME=Android \
	  -DCMAKE_TOOLCHAIN_FILE=${TOOLCHAIN} \
	  -DCMAKE_BUILD_TYPE=Release \
	  -DCMAKE_CXX_FLAGS="-std=c++11 -frtti -fexceptions -march=armv7-a -mfpu=neon" \
	  -DANDROID_ARM_NEON=TRUE \
	  -DCMAKE_C_FLAGS=-O3 \
	  -DANDROID_ABI=${abi} \
	  -DANDROID_PLATFORM=${MIN_SDK} \
	  -DANDROID_TOOLCHAIN=clang \
	  -DANDROID_STL=c++_shared \
	  -DANDROID_CPP_FEATURES=rtti exceptions \
	  -DCMAKE_PREFIX_PATH=../../ \
	  ../../

But it doesnt seems to have any impact on performance

Luca Anzalone · Answer 3 · Sat Nov 30 2019 19:44:42 GMT+0800 (China Standard Time)

Have you tried to add the ABI "armeabi-v7a with NEON" in the script?

You should edit line 17 of setup.sh (I guess you're using Linux), having something like this:

ABI=('armeabi-v7a' 'armeabi-v7a with NEON' 'arm64-v8a' 'x86' 'x86_64')

Let me know if now it works.

For more information you can read this.

Roman Nazarevych · Answer 4 · Sat Nov 30 2019 20:02:47 GMT+0800 (China Standard Time)

I have tried this.
Unfortunately, I don't see any difference.
In Android CMake documentation it is said that

armeabi-v7a with NEON | Same as -DANDROID_ABI=armeabi-v7a -DANDROID_ARM_NEON=ON.

https://developer.android.com/ndk/guides/cmake#android_abi

I'm using face recognition, this example http://dlib.net/dnn_face_recognition_ex.cpp.html
but slightly modified without face detection.
On devices with arm64-v8a it takes around 700ms to calculate face vector.
But on devices with 'armeabi-v7a' from 2 up to 5 seconds to calculate face vector.
I'm wondering if that can be improved with NEON instructions.

Luca Anzalone · Answer 5 · Thu Dec 05 2019 17:11:34 GMT+0800 (China Standard Time)

To gain a performance speed you can try to:

Process grayscale images, instead of rgb.
Downscale the input images, as well as the network input.
Reduce the neural network size, i.e. less layers, less filters.

Roman Nazarevych · Answer 6 · Fri Dec 06 2019 22:42:54 GMT+0800 (China Standard Time)

Thanks for suggestions, will try them!

Max Shevyakov · Answer 7 · Sat Sep 19 2020 01:57:36 GMT+0800 (China Standard Time)

building dlib with linked OpenBLAS improve performance greatly. Instructions can be found here davisking/dlib#1238 (comment)

for example on redmi 7a
face descriptor calculation took ~3.5 sec for me on default prebuilt dlib .so's,
after rebuilding dlib with OpenBLAS it takes ~350ms
10 times faster