Shreyansh Singh's starred repositories
llama-stack-apps
Agentic components of the Llama Stack APIs
Liger-Kernel
Efficient Triton Kernels for LLM Training
flash-linear-attention
Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton
nano-llama31
nanoGPT style version of Llama 3.1
Efficient-LLMs-Survey
[TMLR 2024] Efficient Large Language Models: A Survey
prompt-poet
Streamlines and simplifies prompt design for both developers and non-technical users with a low code approach.
llm-compressor
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Awesome_LLM_System-PaperList
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on accelerating LLMs, currently focusing mainly on inference acceleration, and related works will be gradually added in the future. Welcome contributions!
awesome-llm-planning-reasoning
A curated collection of LLM reasoning and planning resources, including key papers, limitations, benchmarks, and additional learning materials.
applied-ai
Applied AI experiments and examples for PyTorch
flashattention2-custom-mask
Triton implementation of FlashAttention2 that adds Custom Masks.
Guide-NVIDIA-Tools
NVIDIA tools guide
cuda_hgemv
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
lovely-llama
An implementation of the Llama architecture, to instruct and delight
hip-attention
Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.