Comparision between Halide and CUDA version of an app that partitions a 4096x4096 image into 32x32 tiles and computes the summed area table within each tile.
Halide version: runs several kernels - only version 0 computes the summed area table, other kernels are meant to demonstrate the effect of different Halide update definitions on instruction count and global memory throughput.
CUDA version: source code from GPU efficient recursive filtering and summed area table (SIGGRAPH 2011), [Nehab et al.]
- Makefile provided for both projects
- Edit Makefile.common to set the CUDA include path and Halide base path
The directory nv_profile
NVIDIA profiling tools profiling logs. Can be opened using
$ nvvp cuda_summed_table.nvvp
$ nvvp hl_summed_table.nvvp
The directory ptx
and stmt
contains the generated ptx and statement files for the different
Halide kernels. These can be regenerated by:
$ HL_JIT_TARGET=cuda-gpu_debug HL_DEBUG_CODEGEN=1 ./hl_summed_table 2> hl_summed_table.ptx