FlexCNN

Publication

Atefeh Sohrabizadeh, Jie Wang, Jason Cong. End-to-End Optimization of Deep Learning Applications. In FPGA, 2020.

About

This repo contains the codes for building FlexCNN, an accelerator for running CNNs on FPGA, described in here. As mentioned in the paper, you can further integrate FlexCNN to TensorFlow and offload CNN computation of your application to FPGA.

In this repo, we use OpenPose to demonstrate our flow.

Hardware and Operating System

Development

For development, the OS should be Ubuntu 18.04 LTS

Testing and Deployment

For testing and depolyment, despite the os requirement above, the server/PC should also equips with Xilinx Virtex UltraScale+ FPGA VCU1525 Development Kit

Requirements and Dependencies

Requirements

The Xilinx Runtime v2018.3 should be installed.

If you want to compile the library from source, the Xilinx Deployment Shell v2018.3, Xilinx Development Shell and SDAccel Design Environment v2018.3 should also be installed. You can find them through this link.

Dependencies

You should have python3.6 installed. This library uses the cmake (version >= 3.0) as the building system and the googletest as the test framework. It also depends on the tensorflow.

Project File Tree

The project file structure is shown below,

.
+-- auto_compile # Generating hardware configurations and the instructions for running it
+-- data         # data needed for testing the HLS kernel
+-- HLS_Codes    # HLS codes of FlexCNN
+-- libsacc      # library for integrating FlexCNN to TensorFlow
+-- SDx_Project  # SDAccel Project for creating FPGA binary
+-- tf_DSA       # TensorFlow codes for OpenPose application and our integrator

Run the Existing Project

In ./tf_DSA/tf_pose/sacc_utils.py change the following lines (give your project path)

self.sa_dsa_path = '/path/to/libsacc/';
self.custom_lib_path = '/path/to/libsacc/build/lib/libsacc.so';
with open("/path/to/libsacc/inc/sacc_params.h", 'r') as fobj:

Follow the instructions in ./libsacc/README.md to install the library.
Setting environments. Please note that you should modify env.sh with the path to your Xilinx environments.
```
cd tf_DSA
source env.sh
```

Installing packages

python3 -m venv venv
source venv/bin/activate
sudo apt-get install build-essential libcap-dev
pip3 install -r requirements.txt
pip3 install sacc
pip3 install yappi
sudo apt install swig
cd tf_pose/pafprocess/                                      
swig -python -c++ pafprocess.i && python3 setup.py build_ext --inplace

Run the project
```
./test.sh [path/to/your/video/file]
```

You should see your video pops up showing the poses of people in that.

Build Your Own Hardware

Setting environments. Please note that you should modify env.sh with the path to your Xilinx environments.

source env.sh

Compilation System

We will need to generate the instructions required by the FPGA kernel and pre-process the input data and weights of the network.

Translating the TensorFlow graph

We first need to extract the information needed from protobuf file generated by TensorFlow from your CNN model.

cd $PRJ_PATH/auto_compile/protobuf_translation
python3 extract_info.py -p ./protobuf.pbtxt -i ./input.json -n 'image' -o 'Openpose/concat_stage7:0'

Here is the description of the arguments to the command above:

-p : The location where the protobuf text file is stored (do not pass the binary format)
-m : The name of the output file. This file will have the information needed by the hardware. You should pass the file to DSE in the next step.
-i : The name of the json file containing format of the image
-g : Only specify it if you have used a name for your graph in the protobuf text file. Otherwise leave it blank.
-n : The name of the first input tensor of your graph
-o : The name of the last tensor in your graph

You should pass your own protobuf file to the above command. Modify the input.json with the shape of your input tensor. Pass the name of the first tensor using argument (-n) and the last tensor using argument (-o). After this command is finished, it will generate a file named network.model that has all the information we need for the next steps.

Design Space Exploration

Now, switch to design space exploration folder to find out the optimal hardware configurations and the best tiling factors. The folder dse contains scripts to perform the design space exploration. The script dse_p.py can be performed as both single-thread and multi-thread. Run it preferably in multi-threaded form to make it faster. To perform design space exploration, run the command below:

cd ../dse
python3 dse_p.py -m ../protobuf_translation/network.model -i ./input.json -b ./vu9p.json

Here is the description of the arguments to the command above:

-m         : The generated file from protobuf_translation
-i         : The name of the json file containing format of the image
-b         : The name of the json file containing the number of resources of the target FPGA board
--parallel : (True/False) Specify if you want to turn running the multi-threaded version of this code off
--systolic : (True/False) Specify if you want to turn off the search for the optimal systolic array shapes
-dt        : The dynamic tiling level you want to have (0: Disabled
							1: Only number of channels will be dynamic
							2: All the dimensions will be dynamic)

The default is the multi-threaded form. If you want to run it in single-threaded form, change the parallel argument to False. You can choose the degree of dynamic tiling that you want to have by changing dynamic tiling argument (-dt). If you set it to 0, all the tiling factors will be uniform. You should choose 1 when you only want to make the tiling factor for input/output channels dynamic. Lastly, by setting it to 2, which is the default one, height and width tiling will be dynamic as well.

The optimal design parameters will be in the opt_params.json. They are also added to the network model and stored in network_out.model.

Instruction Generation

Switch to the instruction generator folder and run the following command to parse the model and generate the necessary files.

cd ../inst_gen
python inst_parse.py -t ./tile.json -m ../dse/network_out.model -i ./input.json

Here is the description of the arguments to the command above:

-t : The name of the json file containing the maximum tiling factors and the systolic array size
-m : The generated file from DSE
-i : The name of the json file containing format of the image
-o : The name of the output tensors

There will be four files generated:

network.insts: contains instructions to configure the FPGA acclerator to perform the computation tasks.
params.h: contains all the parameters required by the HLS kernel. Copy it to the ./HLS_Codes/ and SDx_project/src/ folders.
weight_offset.dat: helps the host program to load the weights.
bias_offset.dat: helps the host program to load the bias.

The network.insts file contains one instruction for each of the layers. The instructions are filled as follows:

Inst0: in_num_hw  | out_num_hw    | in_h_hw     | in_w_hw     | out_h_hw | out_w_hw
Inst1: in_num     | out_num       | in_h        | in_w        | out_h    | out_w
Inst2: cin_offset | weight_offset | bias_offset | cout_offset | filter_s1, filter_s2 | stride
Inst3: layer_en: conv_1st_en, depth_conv_en, conv_en, relu_en, relu6_en, pool_en, up_sample_en, bias_en, inter_load_en, inter_write_en, batch_norm_en_conv, load_prev_cin, batch_norm_en_depth | prev_cin_offset | in_num_t, out_num_t | in_h_t | in_w_t | nxt_layer_batch
Inst4: task_num1 | task_num2 | local_accum_num | local_reg_num | row_il_factor | col_il_factor

Build the HLS Kernel

You can either choose to use systolic arrays for the core computation unit of the convolutional layers or use a naive implementation of the conv module. To get higher performance, it is recommended to use the systolic array version. However, if you want to check the functionality of your code faster, you may uncomment the kernel and conv_core function in HLS_Codes/kernel.cpp and skip the systolic array generation.

Follow the below instructions to add the systolic arrays:

Switch to the HLS Codes directory.

cd $PRJ_PATH/HLS_Codes

In the auto_compile/inst_gen folder, change the tile.json file to the systolic array size you want.
In the HLS_Codes folder, change the SIMD_LANE in pose.h to the SIMD factor you want.
In the HLS_Codes/systolic_array_kernel folder, change the followings in cnn_features.json to the configs you want. If you have followed the DSE process of last section, you can look for these values in opt_params.json.

SA_ROWS, SA_COLS, SIMD_FACTOR

You should also change the values for FC_SIMD_FACTOR, ROW_IL_FACTOR, COL_IL_FACTOR.

FC_SIMD_FACTOR = SIMD_FACTOR
ROW_IL_FACTOR = OUT_NUM_T / SA_ROWS
COL_IL_FACTOR = OUT_IMG_W_T / SA_COLS

Use the following command to generate the HLS kernel and prepare all the necessary files.

./design_prepare.sh

Now, you can run the HLS C simulation to verify the design.

vivado_hls -f hls_script.tcl

It will take several minutes or so to finish the C simulation.

Build the SDx Project

So far, you have generated the HLS kernel files for the FPGA accelerator. Next, you have to build the bitstream of the FPGA kernel. You need to combine all kernel files into one single file for SDx project.

Prepare the SDx kernel To start with, switch to SDx project directory.

cd $PRJ_PATH/SDx_project

Run the following script to generate the SDx kernel.

./sdx_kernel_create.sh

Now, you should be able to see all the necessary kernel files in the src directory.

Build the bitstream
Generate the bitstream under the System directory.

cd System
make all

It will take several hours or so to generate the bistream. You can change the target frequency in the makefile following the --kernel_frequency [200]. You wil find the host file pose_prj.exe and the bistream binary_container_1.xclbin under the same directory.

For running the kernel, use the command:

./pose_prj.exe binary_container_1.xclbin

Integrate to TensorFlow

Now that you have the bitstream, you can follow the instructions here to integrate your accelerator to TensorFlow.

Citation

If you find any of the ideas/codes useful for your research, please cite our paper:

@inproceedings{sohrabizadeh2020end,
  title={End-to-End Optimization of Deep Learning Applications},
  author={Sohrabizadeh, Atefeh and Wang, Jie and Cong, Jason},
  booktitle={The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays},
  pages={133--139},
  year={2020}
}

atefehsz / FlexCNN

FlexCNN

Publication

About

Content

Hardware and Operating System

Development

Testing and Deployment

Requirements and Dependencies

Requirements

Dependencies

Project File Tree

Run the Existing Project

Build Your Own Hardware

Compilation System

Translating the TensorFlow graph

Design Space Exploration

Instruction Generation

Build the HLS Kernel

Build the SDx Project

Integrate to TensorFlow

Citation

About

Languages