changdae20 / qResNet

Quantize ONNX format deep learning models using custom arbitrary-precision floating-point numbers to optimize performance

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

qResNet

qResNet, which stands for quantized ResNet, facilitates the quantization of the ResNet-18 model using arbitrary-precision floating-point numbers.

Example

In the example below, the ResNet-18 model was quantized using a 15-bit floating-point representation (1 bit for the sign, 10 bits for the mantissa, and 4 bits for the exponent). We used a 128x128 tiger image for inference, and the results were as follows:

root@1723a04c0176:~/qresnet# ./bin/main 
[util.hpp L24] image.rows : 128, image.cols : 128
Current Precision : total 15-bit
=== Predicted Result ===
Top 1 : tiger, Panthera tigris with logit 41.00000000000000000000
Top 2 : tiger cat with logit 34.37500000000000000000
Top 3 : jaguar, panther, Panthera onca, Felis onca with logit 29.09375000000000000000
Top 4 : leopard, Panthera pardus with logit 26.48437500000000000000
Top 5 : snow leopard, ounce, Panthera uncia with logit 25.06250000000000000000
Elapsed time : 4362ms

Dependency

The base used is the nvcr.io/nvidia/pytorch:23.06-py3 docker image, chosen for its inclusion of g++-11, protoc, and CMake. GMP and MPFR were built manually.

How to Build

Ensure all dependencies are installed before proceeding with the build.

mkdir -p build
cd build
cmake ..
cmake --build . -j
cd ..

Operator Support Matrix

Operator Supported Details
Add Y Implemented using a threadpool
BatchNorm2D Y Implemented using a threadpool
Convolution2D Y Implemented using a threadpool
ConvTranspose2D Y Implemented using a threadpool
Flatten Y
GAP(Global Average Pooling) Y Implemented using a threadpool
Gemm(General Matrix Multiplication) Y Implemented using a threadpool
Identity Y
InstanceNorm2D Y Implemented using a threadpool
Leaky ReLU Y Implemented using a threadpool
MaxPool2D Y Implemented using a threadpool
ReLU Y Implemented using a threadpool
Tanh Y Implemented using a threadpool

About

Quantize ONNX format deep learning models using custom arbitrary-precision floating-point numbers to optimize performance

License:MIT License


Languages

Language:C++ 98.5%Language:CMake 1.5%