Borui Xu (BoruiXu)

BoruiXu

Geek Repo

Company:Shandong University

Github PK Tool:Github PK Tool

Borui Xu's starred repositories

self-llm

《开源大模型食用指南》基于Linux环境快速部署开源大模型,更适合**宝宝的部署教程

Language:Jupyter NotebookLicense:Apache-2.0Stargazers:5052Issues:0Issues:0

word-GPT-Plus

Word GPT Plus is a word add-in which integrates the chatGPT model into Microsoft Word. Both official and web api is supported.

Language:VueLicense:MITStargazers:584Issues:0Issues:0

Chinese-LLaMA-Alpaca

中文LLaMA&Alpaca大语言模型+本地CPU/GPU训练部署 (Chinese LLaMA & Alpaca LLMs)

Language:PythonLicense:Apache-2.0Stargazers:17720Issues:0Issues:0

Atom

[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Language:CudaStargazers:195Issues:0Issues:0

mediapipe

Cross-platform, customizable ML solutions for live and streaming media.

Language:C++License:Apache-2.0Stargazers:25876Issues:0Issues:0

qserve

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Language:PythonLicense:Apache-2.0Stargazers:276Issues:0Issues:0

GEAR

GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM

Language:PythonStargazers:107Issues:0Issues:0

calm

CUDA/Metal accelerated language model inference

Language:CLicense:MITStargazers:326Issues:0Issues:0

fastllm

纯c++的全平台llm加速库,支持python调用,chatglm-6B级模型单卡可达10000+token / s,支持glm, llama, moss基座,手机端流畅运行

Language:C++License:Apache-2.0Stargazers:3143Issues:0Issues:0

flash-llm

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Language:CudaLicense:Apache-2.0Stargazers:150Issues:0Issues:0

inferflow

Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).

Language:C++License:MITStargazers:227Issues:0Issues:0

Awesome-LLM-Inference

📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

License:GPL-3.0Stargazers:1667Issues:0Issues:0

mlc-llm

Universal LLM Deployment Engine with ML Compilation

Language:PythonLicense:Apache-2.0Stargazers:17450Issues:0Issues:0

mixtral-offloading

Run Mixtral-8x7B models in Colab or consumer desktops

Language:PythonLicense:MITStargazers:2261Issues:0Issues:0

Edge-MoE

Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts

Language:C++Stargazers:73Issues:0Issues:0

Anima

33B Chinese LLM, DPO QLORA, 100K context, AirLLM 70B inference with single 4GB GPU

Language:Jupyter NotebookLicense:Apache-2.0Stargazers:3396Issues:0Issues:0

LLM-Viewer

Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.

Language:PythonLicense:MITStargazers:195Issues:0Issues:0

lightllm

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.

Language:PythonLicense:Apache-2.0Stargazers:1938Issues:0Issues:0

streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

Language:PythonLicense:MITStargazers:6299Issues:0Issues:0

intel-extension-for-transformers

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡

Language:PythonLicense:Apache-2.0Stargazers:2003Issues:0Issues:0

DeepSpeed-MII

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.

Language:PythonLicense:Apache-2.0Stargazers:1707Issues:0Issues:0

FlexFlow

FlexFlow Serve: Low-Latency, High-Performance LLM Serving

Language:C++License:Apache-2.0Stargazers:1562Issues:0Issues:0
Language:PythonStargazers:234Issues:0Issues:0

lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

Language:PythonLicense:Apache-2.0Stargazers:2882Issues:0Issues:0

TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Language:C++License:Apache-2.0Stargazers:7075Issues:0Issues:0

FlexGen

Running large language models on a single GPU for throughput-oriented scenarios.

Language:PythonLicense:Apache-2.0Stargazers:9047Issues:0Issues:0

perf-book

The book "Performance Analysis and Tuning on Modern CPU"

Language:TeXLicense:CC0-1.0Stargazers:1904Issues:0Issues:0

llama3

The official Meta Llama 3 GitHub site

Language:PythonLicense:NOASSERTIONStargazers:21795Issues:0Issues:0
Language:CStargazers:239Issues:0Issues:0

Vitis-AI

Vitis AI is Xilinx’s development stack for AI inference on Xilinx hardware platforms, including both edge devices and Alveo cards.

Language:PythonLicense:Apache-2.0Stargazers:1392Issues:0Issues:0