davidAlgis / cudaLaunchBenchmark

Test different ways of launching cuda kernels

Repository from Github https://github.comdavidAlgis/cudaLaunchBenchmark

Cuda launch benchmark

A tiny CUDA micro-benchmark to compare two launch strategies for a simple 4-stage pipeline of kernels A → B → C → D:

Host launch — the CPU launches kernels A, B, C, D sequentially on a CUDA stream.
Dynamic parallelism (DP) — a small parent kernel launches A, B, C, D from the device using cudaStreamTailLaunch.

The kernels perform deterministic, moderately compute-heavy math so we can focus on launch overhead and scheduling rather than memory I/O.

Requirements

CUDA Toolkit (e.g., 12.x) with nvcc in PATH.
CMake 3.24+.
A C++17 compiler

Build & run

Configure

# from repo root
py -3 scripts\setup.py configure --export-compile-commands

Build

python3 scripts/setup.py build --config Release

Run

# Run with custom problem size / work and more iterations (Release)
python3 scripts/setup.py run --config Release -- -n 2000000 --iters 128 --runs 20 --block 256

Command-line options

The executable accepts:

--n <int>         # number of elements (default: 1<<20)
--iters <int>     # per-thread ALU work scaling (default: 64)
--runs <int>      # measured repetitions for each strategy (default: 10)
--warmup <int>    # warmup iterations per strategy (default: 3)
--block <int>     # threads per block (default: 256)
--device <int>    # CUDA device index (default: 0)

Results

In practice, host-sequenced launches are often slightly more efficient for this chained, strictly-serialized workload.

About

Test different ways of launching cuda kernels

Languages

Language:Cuda 62.6%Language:Python 32.4%Language:CMake 5.0%