Bruce-Lee-LY's repositories
cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
cuda_hgemv
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
decoding_attention
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
cutlass_gemm
Multiple GEMM operators are constructed with cutlass to support LLM inference.
matrix_multiply
Several common methods of matrix multiplication are implemented on CPU and Nvidia GPU using C++11 and CUDA.
cuda_back2back_hgemm
Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.
memory_pool
Simple and efficient memory pool is implemented with C++11.
thread_pool
Thread pool is implemented to process task queue with C++11.
deep_learning
Implemented the training and inference of several common deep learning model algorithms with tensorflow and pytorch.
algorithm_design
Use several algorithm design methods to solve several common problems with C++11.
data_structure
Several commonly used data structures are implemented with C++11.
machine_learning
Implement several common machine learning algorithms with sklearn.