weishengying's repositories
Cute_exercise
Cute_exercise
flash-attention
Fast and memory-efficient exact attention
OmniQuant
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
tiny-flash-attention
flash attention tutorial written in python, triton, cuda, cutlass
CUDA-Learn-Notes
🎉CUDA 笔记 / 大模型手撕CUDA / C++笔记,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
llm-awq
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
smoothquant
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
cutlass
CUDA Templates for Linear Algebra Subroutines
FasterTransformer
Transformer related optimization, including BERT, GPT
SGEMM_CUDA
Fast CUDA matrix multiplication from scratch
cub
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
NVIDIA_SGEMM_PRACTICE
Step-by-step optimization of CUDA SGEMM
torch-int
This repository contains integer operators on GPUs for PyTorch.
LLMSpeculativeSampling
Fast inference from large lauguage models via speculative decoding
vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
DeepSpeed
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Megatron-LM
Ongoing research training transformer models at scale