- Atefeh Sohrabizadeh, Jie Wang, Jason Cong. End-to-End Optimization of Deep Learning Applications. In FPGA, 2020.
This repo contains the codes for building FlexCNN, an accelerator for running CNNs on FPGA, described in here. As mentioned in the paper, you can further integrate FlexCNN to TensorFlow and offload CNN computation of your application to FPGA.
In this repo, we use OpenPose to demonstrate our flow.
- Hardware and Operating System
- Requirements and Dependencies
- Project File Tree
- Run the Project
- Build Your Own Hardware
- Citation
For development, the OS should be Ubuntu 18.04 LTS
For testing and depolyment, despite the os requirement above, the server/PC should also equips with Xilinx Virtex UltraScale+ FPGA VCU1525 Development Kit
The Xilinx Runtime v2018.3 should be installed.
If you want to compile the library from source, the Xilinx Deployment Shell v2018.3, Xilinx Development Shell and SDAccel Design Environment v2018.3 should also be installed. You can find them through this link.
You should have python3.6 installed. This library uses the cmake (version >= 3.0) as the building system and the googletest as the test framework. It also depends on the tensorflow.
The project file structure is shown below,
.
+-- auto_compile # Generating hardware configurations and the instructions for running it
+-- data # data needed for testing the HLS kernel
+-- HLS_Codes # HLS codes of FlexCNN
+-- libsacc # library for integrating FlexCNN to TensorFlow
+-- SDx_Project # SDAccel Project for creating FPGA binary
+-- tf_DSA # TensorFlow codes for OpenPose application and our integrator
- In ./tf_DSA/tf_pose/sacc_utils.py change the following lines (give your project path)
self.sa_dsa_path = '/path/to/libsacc/'; self.custom_lib_path = '/path/to/libsacc/build/lib/libsacc.so'; with open("/path/to/libsacc/inc/sacc_params.h", 'r') as fobj:
- Follow the instructions in ./libsacc/README.md to install the library.
- Setting environments. Please note that you should modify env.sh with the path to your Xilinx environments.
cd tf_DSA source env.sh
- Installing packages
python3 -m venv venv source venv/bin/activate sudo apt-get install build-essential libcap-dev pip3 install -r requirements.txt pip3 install sacc pip3 install yappi sudo apt install swig cd tf_pose/pafprocess/ swig -python -c++ pafprocess.i && python3 setup.py build_ext --inplace
- Run the project
./test.sh [path/to/your/video/file]
You should see your video pops up showing the poses of people in that.
Setting environments. Please note that you should modify env.sh
with the path to your Xilinx environments.
source env.sh
We will need to generate the instructions required by the FPGA kernel and pre-process the input data and weights of the network.
We first need to extract the information needed from protobuf file generated by TensorFlow from your CNN model.
cd $PRJ_PATH/auto_compile/protobuf_translation
python3 extract_info.py -p ./protobuf.pbtxt -i ./input.json -n 'image' -o 'Openpose/concat_stage7:0'
Here is the description of the arguments to the command above:
-p : The location where the protobuf text file is stored (do not pass the binary format)
-m : The name of the output file. This file will have the information needed by the hardware. You should pass the file to DSE in the next step.
-i : The name of the json file containing format of the image
-g : Only specify it if you have used a name for your graph in the protobuf text file. Otherwise leave it blank.
-n : The name of the first input tensor of your graph
-o : The name of the last tensor in your graph
You should pass your own protobuf file to the above command. Modify the input.json with the shape of your input tensor. Pass the name of the first tensor using argument (-n) and the last tensor using argument (-o).
After this command is finished, it will generate a file named network.model
that has all the information we need for the next steps.
Now, switch to design space exploration folder to find out the optimal hardware configurations and the best tiling factors.
The folder dse
contains scripts to perform the design space exploration. The script dse_p.py
can be performed as both single-thread and multi-thread. Run it preferably in multi-threaded form to make it faster. To perform design space exploration, run the command below:
cd ../dse
python3 dse_p.py -m ../protobuf_translation/network.model -i ./input.json -b ./vu9p.json
Here is the description of the arguments to the command above:
-m : The generated file from protobuf_translation
-i : The name of the json file containing format of the image
-b : The name of the json file containing the number of resources of the target FPGA board
--parallel : (True/False) Specify if you want to turn running the multi-threaded version of this code off
--systolic : (True/False) Specify if you want to turn off the search for the optimal systolic array shapes
-dt : The dynamic tiling level you want to have (0: Disabled
1: Only number of channels will be dynamic
2: All the dimensions will be dynamic)
The default is the multi-threaded form. If you want to run it in single-threaded form, change the parallel argument to False. You can choose the degree of dynamic tiling that you want to have by changing dynamic tiling argument (-dt). If you set it to 0, all the tiling factors will be uniform. You should choose 1 when you only want to make the tiling factor for input/output channels dynamic. Lastly, by setting it to 2, which is the default one, height and width tiling will be dynamic as well.
The optimal design parameters will be in the opt_params.json
. They are also added to the network model and stored in network_out.model
.
Switch to the instruction generator folder and run the following command to parse the model and generate the necessary files.
cd ../inst_gen
python inst_parse.py -t ./tile.json -m ../dse/network_out.model -i ./input.json
Here is the description of the arguments to the command above:
-t : The name of the json file containing the maximum tiling factors and the systolic array size
-m : The generated file from DSE
-i : The name of the json file containing format of the image
-o : The name of the output tensors
There will be four files generated:
network.insts
: contains instructions to configure the FPGA acclerator to perform the computation tasks.params.h
: contains all the parameters required by the HLS kernel. Copy it to the ./HLS_Codes/ and SDx_project/src/ folders.weight_offset.dat
: helps the host program to load the weights.bias_offset.dat
: helps the host program to load the bias.
The network.insts
file contains one instruction for each of the layers. The instructions are filled as follows:
Inst0: in_num_hw | out_num_hw | in_h_hw | in_w_hw | out_h_hw | out_w_hw
Inst1: in_num | out_num | in_h | in_w | out_h | out_w
Inst2: cin_offset | weight_offset | bias_offset | cout_offset | filter_s1, filter_s2 | stride
Inst3: layer_en: conv_1st_en, depth_conv_en, conv_en, relu_en, relu6_en, pool_en, up_sample_en, bias_en, inter_load_en, inter_write_en, batch_norm_en_conv, load_prev_cin, batch_norm_en_depth | prev_cin_offset | in_num_t, out_num_t | in_h_t | in_w_t | nxt_layer_batch
Inst4: task_num1 | task_num2 | local_accum_num | local_reg_num | row_il_factor | col_il_factor
You can either choose to use systolic arrays for the core computation unit of the convolutional layers or use a naive implementation of the conv
module.
To get higher performance, it is recommended to use the systolic array version. However, if you want to check the functionality of your code faster, you may uncomment the kernel
and conv_core
function in HLS_Codes/kernel.cpp
and skip the systolic array generation.
Follow the below instructions to add the systolic arrays:
- Switch to the HLS Codes directory.
cd $PRJ_PATH/HLS_Codes
-
In the
auto_compile/inst_gen
folder, change thetile.json
file to the systolic array size you want. -
In the
HLS_Codes
folder, change the SIMD_LANE inpose.h
to the SIMD factor you want. -
In the
HLS_Codes/systolic_array_kernel
folder, change the followings incnn_features.json
to the configs you want. If you have followed the DSE process of last section, you can look for these values inopt_params.json
.
SA_ROWS, SA_COLS, SIMD_FACTOR
You should also change the values for FC_SIMD_FACTOR, ROW_IL_FACTOR, COL_IL_FACTOR.
FC_SIMD_FACTOR = SIMD_FACTOR
ROW_IL_FACTOR = OUT_NUM_T / SA_ROWS
COL_IL_FACTOR = OUT_IMG_W_T / SA_COLS
- Use the following command to generate the HLS kernel and prepare all the necessary files.
./design_prepare.sh
- Now, you can run the HLS C simulation to verify the design.
vivado_hls -f hls_script.tcl
It will take several minutes or so to finish the C simulation.
So far, you have generated the HLS kernel files for the FPGA accelerator. Next, you have to build the bitstream of the FPGA kernel. You need to combine all kernel files into one single file for SDx project.
- Prepare the SDx kernel To start with, switch to SDx project directory.
cd $PRJ_PATH/SDx_project
- Run the following script to generate the SDx kernel.
./sdx_kernel_create.sh
Now, you should be able to see all the necessary kernel files in the src
directory.
- Build the bitstream
Generate the bitstream under theSystem
directory.
cd System
make all
It will take several hours or so to generate the bistream. You can change the target frequency in the makefile following the --kernel_frequency [200]
.
You wil find the host file pose_prj.exe
and the bistream binary_container_1.xclbin
under the same directory.
- For running the kernel, use the command:
./pose_prj.exe binary_container_1.xclbin
Now that you have the bitstream, you can follow the instructions here to integrate your accelerator to TensorFlow.
If you find any of the ideas/codes useful for your research, please cite our paper:
@inproceedings{sohrabizadeh2020end,
title={End-to-End Optimization of Deep Learning Applications},
author={Sohrabizadeh, Atefeh and Wang, Jie and Cong, Jason},
booktitle={The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays},
pages={133--139},
year={2020}
}