weishengying's repositories

Cute_exercise

Cute_exercise

Language:CudaStargazers:1Issues:0Issues:0

flash-attention

Fast and memory-efficient exact attention

Language:PythonLicense:BSD-3-ClauseStargazers:0Issues:0Issues:0
License:MITStargazers:0Issues:0Issues:0

OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.

License:MITStargazers:0Issues:0Issues:0

tiny-flash-attention

flash attention tutorial written in python, triton, cuda, cutlass

Stargazers:0Issues:0Issues:0

CUDA-Learn-Notes

🎉CUDA 笔记 / 大模型手撕CUDA / C++笔记,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.

License:GPL-3.0Stargazers:0Issues:0Issues:0

AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:

License:MITStargazers:1Issues:0Issues:0

AutoGPTQ

An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

License:MITStargazers:0Issues:0Issues:0

llm-awq

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

License:MITStargazers:0Issues:0Issues:0

smoothquant

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

License:MITStargazers:0Issues:0Issues:0

MoE

MoE layer for pytorch

Language:C++Stargazers:1Issues:0Issues:0

TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

License:Apache-2.0Stargazers:0Issues:0Issues:0

cutlass

CUDA Templates for Linear Algebra Subroutines

License:NOASSERTIONStargazers:0Issues:0Issues:0

FasterTransformer

Transformer related optimization, including BERT, GPT

License:Apache-2.0Stargazers:0Issues:0Issues:0

SGEMM_CUDA

Fast CUDA matrix multiplication from scratch

Stargazers:0Issues:0Issues:0

cub

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

License:BSD-3-ClauseStargazers:0Issues:0Issues:0

NVIDIA_SGEMM_PRACTICE

Step-by-step optimization of CUDA SGEMM

Language:CudaStargazers:0Issues:0Issues:0

torch-int

This repository contains integer operators on GPUs for PyTorch.

License:MITStargazers:0Issues:0Issues:0

LLMSpeculativeSampling

Fast inference from large lauguage models via speculative decoding

Stargazers:0Issues:0Issues:0

vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Language:PythonLicense:Apache-2.0Stargazers:0Issues:0Issues:0

DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

License:Apache-2.0Stargazers:0Issues:0Issues:0

Megatron-LM

Ongoing research training transformer models at scale

License:NOASSERTIONStargazers:0Issues:0Issues:0
License:GPL-3.0Stargazers:0Issues:0Issues:0