There are 16 repositories under cuda-kernels topic.
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
Deep learning in Rust, with shape checked tensors and neural networks
Safe rust wrapper around CUDA toolkit
🚀 你的YOLO部署神器。TensorRT Plugin、CUDA Kernel、CUDA Graphs三管齐下,享受闪电般的推理速度。| Your YOLO Deployment Powerhouse. With the synergy of TensorRT Plugins, CUDA Kernels, and CUDA Graphs, experience lightning-fast inference speeds.
Kernel Tuner
This is an archive of materials produced for an introductory class on CUDA programming at Stanford University in 2010
Amplifier allows .NET developers to easily run complex applications with intensive mathematical computation on Intel CPU/GPU, NVIDIA, AMD without writing any additional C kernel code. Write your function in .NET and Amplifier will take care of running it on your favorite hardware.
Some CUDA design patterns and a bit of template magic for CUDA
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
Triton implementation of FlashAttention2 that adds Custom Masks.
A tool for examining GPU scheduling behavior.
(REOS) Radar and Electro-Optical Simulation Framework written in C++.
CUDA Guide
Speed up image preprocess with cuda when handle image or tensorrt inference
(REOS) Radar and ElectroOptical Simulation Framework written in Fortran.
Astrophysics program simulating the evolution of star systems based on the fast multipole method on adaptive Octrees
Implementation of ConjugateGradients method using C and Nvidia CUDA
Using custom CUDA kernels with Open CV Mat objects.
Bandicoot: C++ library for GPU linear algebra & scientific computing - https://coot.sourceforge.io
CUDA C implementation of Principal Component Analysis (PCA) through Singular Value Decomposition (SVD) using a highly parallelisable version of the Jacobi eigenvalue algorithm.
Quantum-inspired evolutionary algorithms for Optimization problems
This is a Lattice-Boltzmann simulation using CUDA GPU graphics optimization.
2D Game texture special effects
Faster Pytorch bitsandbytes 4bit fp4 nn.Linear ops
Implement Neural Networks in Cuda from Scratch
A Complete beginner's introduction to programming with CUDA Fortran