There are 14 repositories under cuda-kernels topic.
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
Deep learning in Rust, with shape checked tensors and neural networks
🎉CUDA 笔记 / 大模型手撕CUDA / C++笔记,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
Safe rust wrapper around CUDA toolkit
Kernel Tuner
This is an archive of materials produced for an introductory class on CUDA programming at Stanford University in 2010
Amplifier allows .NET developers to easily run complex applications with intensive mathematical computation on Intel CPU/GPU, NVIDIA, AMD without writing any additional C kernel code. Write your function in .NET and Amplifier will take care of running it on your favorite hardware.
Some CUDA design patterns and a bit of template magic for CUDA
A tool for examining GPU scheduling behavior.
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
(REOS) Radar and Electro-Optical Simulation Framework written in C++.
CUDA Guide
Astrophysics program simulating the evolution of star systems based on the fast multipole method on adaptive Octrees
Implementation of ConjugateGradients method using C and Nvidia CUDA
(REOS) Radar and ElectroOptical Simulation Framework written in Fortran.
Using custom CUDA kernels with Open CV Mat objects.
Speed up image preprocess with cuda when handle image or tensorrt inference
CUDA C implementation of Principal Component Analysis (PCA) through Singular Value Decomposition (SVD) using a highly parallelisable version of the Jacobi eigenvalue algorithm.
Bandicoot: C++ library for GPU linear algebra & scientific computing - https://coot.sourceforge.io
Quantum-inspired evolutionary algorithms for Optimization problems
2D Game texture special effects
This is a Lattice-Boltzmann simulation using CUDA GPU graphics optimization.
🦚 🧰 Collection of basic GPU algorithms implemented in CUDA C++.
This repository contains examples CUDA usage in Cython code.
Faster Pytorch bitsandbytes 4bit fp4 nn.Linear ops
C++ cross-platform gpu SDK
Non Local Means Filter for Image Denoising in CUDA