Bruce-Lee-LY

Bruce-Lee-LY's repositories

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Language:CudaMIT382 5 14

Hooked CUDA-related dynamic libraries by using automated code generation tools.

Language:CMIT166 2 12

Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.

Language:CudaMIT60 50

Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.

Language:C++BSD-3-Clause42 20

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

Language:C++BSD-3-Clause35 1 4

Multiple GEMM operators are constructed with cutlass to support LLM inference.

Language:C++BSD-3-Clause17 10

Several common methods of matrix multiplication are implemented on CPU and Nvidia GPU using C++11 and CUDA.

Language:C++MIT15 20

Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.

Language:CudaMIT11 2 1

Simple and efficient memory pool is implemented with C++11.

Language:C++MIT8 20

Thread pool is implemented to process task queue with C++11.

Language:C++MIT3 30

Implemented the training and inference of several common deep learning model algorithms with tensorflow and pytorch.

Language:PythonMIT1 20

Use several algorithm design methods to solve several common problems with C++11.

Language:C++MIT020

Several fun crawler cases implemented in Python.

Language:PythonMIT020

Several commonly used data structures are implemented with C++11.

Language:C++MIT020

Implement several common machine learning algorithms with sklearn.

Language:PythonMIT020