Weili17's starred repositories

ray-llm

RayLLM - LLMs on Ray

Language:PythonLicense:Apache-2.0Stargazers:1208Issues:0Issues:0

gpustack

Manage GPU clusters for running LLMs

Language:PythonLicense:Apache-2.0Stargazers:164Issues:0Issues:0

giantpandacv.com

www.giantpandacv.com

Language:PythonLicense:NOASSERTIONStargazers:147Issues:0Issues:0

ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

Language:PythonLicense:Apache-2.0Stargazers:32469Issues:0Issues:0

NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

Language:PythonLicense:Apache-2.0Stargazers:11150Issues:0Issues:0

AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:

Language:PythonLicense:MITStargazers:1523Issues:0Issues:0

AutoGPTQ

An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

Language:PythonLicense:MITStargazers:4189Issues:0Issues:0

smoothquant

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Language:PythonLicense:MITStargazers:1131Issues:0Issues:0
Stargazers:69Issues:0Issues:0

sglang

SGLang is yet another fast serving framework for large language models and vision language models.

Language:PythonLicense:Apache-2.0Stargazers:3781Issues:0Issues:0

TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.

Language:PythonLicense:Apache-2.0Stargazers:1699Issues:0Issues:0

llama-cpp-python

Python bindings for llama.cpp

Language:PythonLicense:MITStargazers:7350Issues:0Issues:0

llama.cpp

llama 2 Inference

Language:CLicense:MITStargazers:31Issues:0Issues:0

ollama

Get up and running with Llama 3.1, Mistral, Gemma 2, and other large language models.

Language:GoLicense:MITStargazers:83097Issues:0Issues:0

step_into_llm

MindSpore online courses: Step into LLM

Language:Jupyter NotebookLicense:Apache-2.0Stargazers:389Issues:0Issues:0

CUDA_Programming

《CUDA编程基础与实践》一书的代码

Language:CudaStargazers:73Issues:0Issues:0

CUDA-Programming

Sample codes for my CUDA programming book

Language:CudaLicense:GPL-3.0Stargazers:1480Issues:0Issues:0

KuiperLLama

动手实现大模型推理框架

Language:C++Stargazers:101Issues:0Issues:0

cuda-samples

Samples for CUDA Developers which demonstrates features in CUDA Toolkit

Language:CLicense:NOASSERTIONStargazers:5887Issues:0Issues:0

lightseq

LightSeq: A High Performance Library for Sequence Processing and Generation

Language:C++License:NOASSERTIONStargazers:3145Issues:0Issues:0

Learn-CUDA-Programming

Learn CUDA Programming, published by Packt

Language:CudaLicense:MITStargazers:960Issues:0Issues:0

lectures

Material for cuda-mode lectures

Language:Jupyter NotebookLicense:Apache-2.0Stargazers:2023Issues:0Issues:0

baby-llama2-chinese_cybertron

使用单个24G显卡,从0开始训练LLM

Language:PythonLicense:MITStargazers:43Issues:0Issues:0

llama.cpp

LLM inference in C/C++

Language:C++License:MITStargazers:62932Issues:0Issues:0

DeepSpeed-MII

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.

Language:PythonLicense:Apache-2.0Stargazers:1796Issues:0Issues:0

batch-prompting

[EMNLP 2023 Industry Track] A simple prompting approach that enables the LLMs to run inference in batches.

Language:PythonStargazers:63Issues:0Issues:0

SpeculativeDecodingPapers

📰 Must-read papers and blogs on Speculative Decoding ⚡️

License:Apache-2.0Stargazers:291Issues:0Issues:0

long-context-attention

Sequence Parallel Attention for Long Context LLM Model Training and Inference

Language:PythonStargazers:244Issues:0Issues:0

SwiftSage

SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks

Language:PythonStargazers:232Issues:0Issues:0

TurboTransformers

a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.

Language:C++License:NOASSERTIONStargazers:1459Issues:0Issues:0