atrifex / CNN-Acceleration

Implementing CNN code in CUDA and OpenCL to evaluate its performance on NVIDIA GPUs, AMD GPUs, and an FPGA platform.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Code Acceleration

The goal of this project is two fold. First, accelerate the forward propagation step of the Convolutional Neural Network (CNN) algorithm using CUDA and OpenCL. Second, evaluate its performance on NVIDIA, AMD GPUs, and an FPGA platform.

CNN and MNIST

Provided is a model that has been trained using 60,000 examples (training set images) and the provided test data is 10,000 batched queries (test set images). The expected accuracy of the CNN is ~87% on the provided test dataset.

The dataset and model are from the MNIST database.

System Requirements And Building the Project

The project requires a C++ compiler, CUDA 8 Toolkit, and OpenCL 1.2 or higher.

The CUDA 8 Toolkit can be downloaded from the CUDA Download page. Instructions on how to install the CUDA Toolkit are available in the Quick Start page. Installation guides and the list of supported C compilers for Windows, Linux, and OSX are also found in the CUDA Toolkit Documentation Page. Aside from a C compiler and the CUDA 8 Toolkit, CMake 3.1 or later is required to generate build scripts for your target IDE and compiler.

Building CUDA and OpenCL versions

To build the CUDA and OpenCL versions the Hunter package manager with Cmake needs to be used. Install the libraries needed (mainly HDF5).

Assuming that you have checked out the project into $PROJECT_DIR, execute the following sequence of commands:

cd $PROJECT_DIR/cuda_implementation/
mkdir build
cd build
cmake ../

This will download the required software needed for the project (see the hunter docs for more information). You may see some warning while the system is compiling HDF5, which you can ignore. Once CMake has been run, a Makefile is generated so you can then perform make to build the project.

make

The same sequence of commands need to be executed in the openCL_implementation directory to run the OpenCL version.

If you do not plan on using make, examine the cmake -G option which allows you to generate XCode, Visual Studio, ... project configurations. You may also need to change the build type to enable/disable debugging and/or optimizations.

If you need to use another library, you need have to modify the [CMakeLists.txt] and add the libraries to the target_link_libraries (and possibly the include_directories) section. Documentation on the CMake commands is found in the documentation page.

Building FPGA Version with OpenCL Standard

Since the FPGA version needs to use modified versions of the OpenCL libraries provided by Intel, it makes more sense to just use the Makefile provided by Intel.

The first step in building is figuring out which boards are available. The following command will list all of the available boards:

aoc --list-boards

Compiling the OpenCL Kernel to Run on Board

cd $PROJECT_DIR/fpgaOpenCL_implementation/
aoc device/cnn.cl -o bin/cnn.aocx --board <board>

Compiling the OpenCL Kernel for Emulator

cd $PROJECT_DIR/fpgaOpenCL_implementation/
aoc -march=emulator device/cnn.cl -o bin/cnn.aocx --board <board>

Compiling the OpenCL Host Code

Assuming that you have checked out the project into $PROJECT_DIR, execute the following sequence of commands:

cd $PROJECT_DIR/fpgaOpenCL_implementation/
make

This will generate a bin folder that will contain the executable that can run the FPGA version of the code.

How to Run Code

Test your implementation with small batch size first to verify the correctness. You can parse the data/test100.hdf5 into smaller chunks using your preferred language(e.g. python). 2, 10 and 100 queries are provides in data/test2.hdf5, data/test10.hdf5 and data/test100.hdf5 in the data folder. Maker sure the data file you feed in has the same batch size as the batch_size you specify in the command line.

To run the code, you need to be in the build or bin directories of a given version of the project.

CUDA Version

./cuda_CNN ../../data/test10.hdf5 ../../data/model.hdf5 10

OpenCL Version

./openCL_CNN ../../data/test10.hdf5 ../../data/model.hdf5 10

FPGA Version on Board

./fpga_CNN ../../../data/test10.hdf5 ../../../data/model.hdf5 10

FPGA Version with Emulator

CL_CONTEXT_EMULATOR_DEVICE_ALTERA=1 ./fpga_CNN ../../../data/test10.hdf5 ../../../data/model.hdf5 10

Results

The following are the results on a Tesla P100 GPU as of 4/22/2017.

CUDA Version

test10: 15.9631 milliseconds

test100: 17.6193 milliseconds

testfull: 524.852 milliseconds

OpenCL Version

test10: 3.28717 milliseconds

test100: 3.76949 milliseconds

testfull: 34.1698 milliseconds

FPGA Version on Board

Need to get access to system with an FPGA board to test this version of the code.

FPGA Version with Emulator

test10: 7423.32 milliseconds

test100: 68188.2 milliseconds

testfull: TIMES OUT

Note: Major optimizations need to be made before the results from the FPGA version can match the results from the OpenCL and CUDA versions of the code.

Reporting Issues

Please use the GitHub issue manager to report any issues or suggestions about the project.

Resources Used

About

Implementing CNN code in CUDA and OpenCL to evaluate its performance on NVIDIA GPUs, AMD GPUs, and an FPGA platform.

License:GNU General Public License v3.0


Languages

Language:C++ 34.5%Language:CMake 31.7%Language:C 19.6%Language:Cuda 13.4%Language:Makefile 0.6%Language:Shell 0.1%