shreyansh26

followers

following

stars

Level AI

New Delhi

https://shreyansh26.github.io

Organizations

COPS-IITBHU

Shreyansh Singh's starred repositories

neovim

Vim-fork focused on extensibility and usability

Language:Vim ScriptNOASSERTION82150 974 11724

llama-stack-apps

Agentic components of the Llama Stack APIs

Language:PythonMIT3623 40 47

Liger-Kernel

Efficient Triton Kernels for LLM Training

Language:PythonBSD-2-Clause3110 35 71

flash-linear-attention

Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton

Language:PythonMIT1238 24 44

nano-llama31

nanoGPT style version of Llama 3.1

Language:Python1192 21 5

FBGEMM

FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/

Language:C++NOASSERTION1176 66 165

Efficient-LLMs-Survey

[TMLR 2024] Efficient Large Language Models: A Survey

Apache-2.0970 24 11

nanoT5

Fast & Simple repository for pre-training and fine-tuning T5-style models

Language:PythonApache-2.0961 17 39

rerankers

A lightweight, low-dependency, unified API to use all common reranking and cross-encoder models.

Language:PythonApache-2.0949 9 16

prompt-poet

Streamlines and simplifies prompt design for both developers and non-technical users with a low code approach.

Language:PythonMIT861 6 6

bm25s

Fast lexical search library implementing BM25 in Python using Numpy and Scipy

Language:PythonMIT792 4 24

cookbook

Deep learning for dummies. All the practical details and useful utilities that go into working with real models.

Language:PythonApache-2.0679 13 14

Nanoflow

A throughput-oriented high-performance serving framework for LLMs

Language:CudaApache-2.0545 5 14

llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM

Language:PythonApache-2.0514 13 70

Sophia

Effortless plugin and play Optimizer to cut model training costs by 50%. New optimizer that is 2x faster than Adam on LLMs.

Language:PythonApache-2.0376 8 25

cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Language:CudaMIT270 4 12

hash-hop

Long context evaluation for large language models

Language:PythonMIT172 6 3

Awesome_LLM_System-PaperList

Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on accelerating LLMs, currently focusing mainly on inference acceleration, and related works will be gradually added in the future. Welcome contributions!

AutoFP8

Language:PythonApache-2.0152 13 26

awesome-llm-planning-reasoning

A curated collection of LLM reasoning and planning resources, including key papers, limitations, benchmarks, and additional learning materials.

MIT150 8 2

applied-ai

Applied AI experiments and examples for PyTorch

Language:PythonBSD-3-Clause136 12 9

flashattention2-custom-mask

Triton implementation of FlashAttention2 that adds Custom Masks.

Language:PythonApache-2.063 4 4

ml-recurrent-drafter

Language:PythonApache-2.062 12 3

Guide-NVIDIA-Tools

NVIDIA tools guide

Language:Cuda6100

benchmark

Benchmark suite for LLMs from Fireworks.ai

Language:PythonApache-2.052 4 1

cuda_hgemv

Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.

Language:CudaMIT45 30

Palu

Code for Palu: Compressing KV-Cache with Low-Rank Projection

Language:PythonMIT42 2 3

lovely-llama

An implementation of the Llama architecture, to instruct and delight

Language:PythonMIT2100

hydragen

Hydragen: High-Throughput LLM Inference with Shared Prefixes

Language:PythonApache-2.018 1 2

hip-attention

Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.

Language:Python14 50