tensorflow / tflite-micro

Infrastructure to enable deployment of ML models to low-power resource-constrained embedded targets (including microcontrollers and digital signal processors).

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CMSIS-NN results in no improvement in FCNs

Black3rror opened this issue · comments

commented

I have a NUCLEO-L4R5ZI board (having ARM Cortex-M4), and I've done some experiments with TFLM having CMSIS-NN or not (having OPTIMIZED_KERNEL_DIR=cmsis_nn" or not). I've tested multiple fully connected networks and multiple CNNs.

The only scenarios that CMSIS-NN helped with the execution time of the network were:

  • Having a CNN, using an integer quantization (I've used full_int, full_int_only (meaning the inputs/outputs are also quantized), 16x8, 16x8_int_only). This gave me a 3 to 4 times better performance.

Scenarios that CMSIS-NN did not help: (all other situations)

  • Having FC with any quantization
  • Having a CNN with a basic or dynamic quantization

I would like to ask why CMSIS-NN only helps those specific scenarios.
To my understanding, CMSIS-NN utilizes the SIMD instructions to speed up the computations. Therefore, it definitely should be possible to do the same when having an FC layer rather than a CNN (doing so for FC is as relevant, if not more relevant). So, why does CMSIS-NN only help CNNs? Also, the same concept should apply to fp32. So, why does quantization matter in this context?

Hi @Black3rror
CMSIS-NN only supports int8 and int16 activations and int8 weights. It also supports int4 packed weights to some extent. Currently there is no float support.
You can see that in case of float it falls back to reference kernels:

commented

Hi @mansnils, and thanks a lot for the answer.
Still, it remains the question: Why does it only help the CNNs and not FCs?

What do you mean with FCs? Is is int8 or int16 quantized fully connected? Then it should help.
Feel free to upload the model here if that is possible.

commented

@mansnils Yes, it's an int8 quantized model.

Let me share with you some results of my experiments. I've tested different fully connected models, converted to tflite as a simple conversion (basic - meaning that no quantization is done and the parameters and activations are in float32) and converted to tflite using full integer quantization (q_full_int_only - meaning that everything is in int8, even inputs and outputs). The results are shown in this table:

TFLM basic with cmsis TFLM q_full_int_only with cmsis TFLM basic without cmsis
FC_min 0.085417 ms (10250 ticks) 0.080567 ms (9668 ticks) 0.085450 ms (10254 ticks)
FC_xs 1.760100 ms (211212 ticks) 2.013417 ms (241610 ticks) 1.755567 ms (210668 ticks)
FC_s 6.743333 ms (809200 ticks) 7.287267 ms (874472 ticks) 6.744583 ms (809350 ticks)
FC_m 31.818100 ms (3818172 ticks) 31.981216 ms (3837746 ticks) 31.977200 ms (3837264 ticks)
FC_l 126.057663 ms (15126920 ticks) 124.876732 ms (14985208 ticks) 126.689980 ms (15202798 ticks)
FC_boston_s 4.839800 ms (580776 ticks) 5.898883 ms (707866 ticks) 4.834784 ms (580174 ticks)
FC_boston_m 16.922884 ms (2030746 ticks) 18.926167 ms (2271140 ticks) 16.913982 ms (2029678 ticks)

You can see that CMSIS-NN is not helping even with an integer quantization.

To make things more concrete and let this discussion be around an example, I've made this colab notebook, in which you can check the model creation and conversion for yourself and use the generated tflite models in a project to measure its execution time. The results are as follows:

  • Board: NUCLEO-L4R5ZI (ARM Cortex-M4)
  • IDE: STM32CubeIDE
  • TFLite files: tflite_models.zip
TFLM basic with cmsis TFLM q_full_int_only with cmsis TFLM basic without cmsis TFLM q_full_int_only without cmsis
[784, 100, 50, 1] 31.905800 ms (3828696 ticks) 32.330032 ms (3879604 ticks) 31.948383 ms (3833806 ticks) 36.846432 ms (4421572 ticks)

As you can see, CMSIS-NN in not helping that much (in case we had a convolutional model, CMSIS-NN could reduce the execution time by a factor of 3 or 4).

Please check and let me know your thoughts.
Thanks in advance.

Thanks for the detailed example. From that I generated tflite_full_int_only.tflite and ran it with reference and CMSIS-NN kernels on a MPS2 board with Arm(R) Cortex(R)-M4 processor and compared the cycle count. There are 73% less cycles for the CMSIS-NN build, so no surprises I would say.
Could you put an error pragma here and make sure it is hit for the CMSIS-NN build?
https://github.com/ARM-software/CMSIS-NN/blob/4b46c85b7a43c2e8a313aa90ed65bf58213c1f15/Source/NNSupportFunctions/arm_nn_vec_mat_mult_t_s8.c#L540
Just note that it does not makes sense to compare TFLM basic with and without CMIS-NN since it will run the the reference kernels in both cases, assuming basic is float.

Also please check that your compiler options is inline with this: https://github.com/ARM-software/CMSIS-NN/blob/main/README.md#compiler-options. If not please try to enable -Ofast.