SunshineZhang's repositories
accelerate
🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
ANT-Quantization
LLM推理-OliVe
Awesome-LLM-Compression
Awesome LLM compression research papers and tools.
bitsandbytes
LLM:8-bit CUDA functions for PyTorch
clash
A rule-based tunnel in Go.
cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
data-parallel-CPP
Source code for 'Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL' by James Reinders, Ben Ashbaugh, James Brodman, Michael Kinsner, John Pennycook, Xinmin Tian (Apress, 2020).
DeepSpeed
LLM:DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
dpctl
Python SYCL bindings and SYCL-based Python Array API library
FasterTransformer
Transformer related optimization, including BERT, GPT
FlexGen
LLM:FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
llama.cpp
LLM inference in C/C++
llm-awq
LLM:AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
LLM-Pruner
[NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support LLaMA, Llama-2, BLOOM, Vicuna, Baichuan, etc.
LLMBox
大语言模型(2024人民大学版-配套代码资源)
llvm
Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects.
lm-evaluation-harness
LLM:A framework for few-shot evaluation of autoregressive language models.
LMOps
General technology for enabling AI capabilities w/ LLMs and MLLMs
mlc-llm
Enable everyone to develop, optimize and deploy AI models natively on everyone's devices.
neural-compressor
LLM:Provide unified APIs for SOTA model compression techniques, such as low precision (INT8/INT4/FP4/NF4) quantization, sparsity, pruning, and knowledge distillation on mainstream AI frameworks such as TensorFlow, PyTorch, and ONNX Runtime.
pybind11
Seamless operability between C++11 and Python
qlora
LLM:QLoRA: Efficient Finetuning of Quantized LLMs
QUIK
Repository for the QUIK project, enabling the use of 4bit kernels for generative inference
smoothquant
LLM:[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
SpQR
LLM:SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
tabby
A terminal for a more modern age
TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
wanda
LLM Pruning:A simple and effective LLM pruning approach.
YuLan-Chat
大语言模型(2024人民大学版-配套代码资源)