There are 16 repositories under cuda-kernels topic.
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
Deep learning in Rust, with shape checked tensors and neural networks
Safe rust wrapper around CUDA toolkit
Simple utilities to enable code reuse and portability between CUDA C/C++ and standard C/C++.
Kernel Tuner
This is an archive of materials produced for an introductory class on CUDA programming at Stanford University in 2010
Amplifier allows .NET developers to easily run complex applications with intensive mathematical computation on Intel CPU/GPU, NVIDIA, AMD without writing any additional C kernel code. Write your function in .NET and Amplifier will take care of running it on your favorite hardware.
Some CUDA design patterns and a bit of template magic for CUDA
Open source cross-platform compiler for compute-intensive loops used in AI algorithms, from Microsoft Research
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
Triton implementation of FlashAttention2 that adds Custom Masks.
A tool for examining GPU scheduling behavior.
Speed up image preprocess with cuda when handle image or tensorrt inference
CUDA Guide
(REOS) Radar and Electro-Optical Simulation Framework written in C++.
Astrophysics program simulating the evolution of star systems based on the fast multipole method on adaptive Octrees
Implementation of ConjugateGradients method using C and Nvidia CUDA
(REOS) Radar and ElectroOptical Simulation Framework written in Fortran.
Using custom CUDA kernels with Open CV Mat objects.
Bandicoot: C++ library for GPU linear algebra & scientific computing - https://coot.sourceforge.io
Faster Pytorch bitsandbytes 4bit fp4 nn.Linear ops
CUDA C implementation of Principal Component Analysis (PCA) through Singular Value Decomposition (SVD) using a highly parallelisable version of the Jacobi eigenvalue algorithm.
Quantum-inspired evolutionary algorithms for Optimization problems
This is a Lattice-Boltzmann simulation using CUDA GPU graphics optimization.
StiffMa: Fast finite element STIFFness MAtrix generation in MATLAB by using GPU computing.
The cuda code is mainly for nvidia hardware device. This repo will show how to run cuda c or cuda cpp code on the google colab platform for free.
2D Game texture special effects
Attention Kernels for Symmetric Power Transformers