xuqiantong/CUDA-Winograd

Introduction

This code implements fast cuda kernels for DNN inference, especially for convolution layers / residule blocks in ResNet. Specifically, the kernels combine three parts into one piece:

Convolution
Batch Nomalization (BN + Scale)
Activation (ReLU)

For implementation details, please refer to the technical report included in this repo. Winograd algorithm is used for 3 * 3 convolutional kernels.

Usage

mkdir data
python data_generator.py
make
./Test 0

Set parameters in data_generator.py
Run 6 test cases with changing numbers from 0 to 5 after ./Test

Results

3 * 3 Kernels

Kernals	Operations	128 / 128	256 / 256
Cudnn	Gemm + BN + ReLU	214us	384us
Cudnn	Winograd + BN + ReLU	95us	155us
Our Kernel	Winograd + BN + ReLU	59us	117us

1 * 1 Kernels [BUGGY NUMBERS]

Kernals	512 / 128	128 / 512	1024 / 256	256 / 1024
Operations	Gemm + BN + ReLU	Gemm + BN	Gemm + BN + ReLU	Gemm + BN + ReLU
Cudnn	119us	115us	219us	214us
Our Kernel	58us	55us	186us	181us

About

Fast CUDA Kernels for ResNet Inference.

Languages

Language:Cuda 83.7%Language:C 8.8%Language:Python 6.7%Language:Makefile 0.9%