inference

There are 84 repositories under inference topic.

vllm-project / vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
amd blackwell cuda deepseek deepseek-v3 gpt gpt-oss inference kimi llama llm llm-serving model-serving moe openai pytorch qwen qwen3 tpu transformer
Language:Python 62461
whisper.cpp
ggml-org / whisper.cpp
Port of OpenAI's Whisper model in C/C++
inference openai speech-recognition speech-to-text transformer whisper
Language:C++ 44319
ColossalAI
hpcaitech / ColossalAI
Making large AI models cheaper, faster and more accessible
ai big-model data-parallelism deep-learning distributed-computing foundation-models heterogeneous-training hpc inference large-scale model-parallelism pipeline-parallelism
Language:Python 41227
deepspeedai / DeepSpeed
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
billion-parameters compression data-parallelism deep-learning gpu inference machine-learning mixture-of-experts model-parallelism pipeline-parallelism pytorch trillion-parameters zero
Language:Python 40633
google-ai-edge / mediapipe
Cross-platform, customizable ML solutions for live and streaming media.
android audio-processing c-plus-plus calculator computer-vision deep-learning framework graph-based graph-framework inference machine-learning mediapipe mobile-development perception pipeline-framework stream-processing video-processing
Language:C++ 31851
Tencent / ncnn
ncnn is a high-performance neural network inference framework optimized for the mobile platform
android arm-neon artificial-intelligence caffe darknet deep-learning high-preformance inference ios keras mlir mxnet ncnn neural-network onnx pytorch riscv simd tensorflow vulkan
Language:C++ 22248
sgl-project / sglang
SGLang is a fast serving framework for large language models and vision language models.
blackwell cuda deepseek deepseek-r1 deepseek-v3 deepseek-v3-2 gpt-oss inference kimi llama llama3 llava llm llm-serving moe openai pytorch qwen3 transformer vlm
Language:Python 19974
SYSTRAN / faster-whisper
Faster Whisper transcription with CTranslate2
deep-learning inference openai quantization speech-recognition speech-to-text transformer whisper
Language:Python 18910
stas00 / ml-engineering
Machine Learning Engineering Open Book
ai inference large-language-models llm machine-learning machine-learning-engineering mlops pytorch scalability slurm training transformers
Language:Python 15640
ts-pattern
gvergnaud / ts-pattern
🎨 The exhaustive Pattern Matching library for TypeScript, with smart type inference.
branching conditions exhaustive inference javascript matching pattern pattern-matching ts type-inference typescript
Language:TypeScript 14429
NVIDIA / TensorRT
NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
deep-learning gpu-acceleration inference nvidia tensorrt
Language:C++ 12339
aws / amazon-sagemaker-examples
Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
aws data-science deep-learning examples inference jupyter-notebook machine-learning mlops reinforcement-learning sagemaker training
Language:Jupyter Notebook 10796
huggingface / text-generation-inference
Large Language Model Text Generation Inference
bloom deep-learning falcon gpt inference nlp pytorch starcoder transformer
Language:Python 10628
triton-inference-server / server
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
cloud datacenter deep-learning edge gpu inference machine-learning
Language:Python 9994
openvinotoolkit / openvino
OpenVINO™ is an open source toolkit for optimizing and deploying AI inference
inference deep-learning openvino ai computer-vision diffusion-models generative-ai llm-inference natural-language-processing nlp performance-boost speech-recognition stable-diffusion deploy-ai optimize-ai transformers yolo recommendation-system good-first-issue
Language:C++ 9181
xorbitsai / inference
Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
artificial-intelligence chatglm deployment flan-t5 gemma ggml glm4 inference llama llama3 llamacpp llm machine-learning mistral openai-api pytorch qwen vllm whisper wizardlm
Language:Python 8705
oumi-ai / oumi
Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!
dpo evaluation fine-tuning gpt-oss gpt-oss-120b gpt-oss-20b inference llama llms sft slms vlms
Language:Python 8590
jetson-inference
dusty-nv / jetson-inference
Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson.
deep-learning inference computer-vision embedded image-recognition object-detection segmentation jetson jetson-tx1 jetson-tx2 jetson-xavier nvidia tensorrt digits caffe video-analytics robotics machine-learning jetson-nano jetson-xavier-nx
Language:C++ 8576
GeeeekExplorer / nano-vllm
Nano vLLM
deep-learning inference llm nlp pytorch transformer
Language:Python 8473
Linzaer / Ultra-Light-Fast-Generic-Face-Detector-1MB
💎1MB lightweight face detection model (1MB轻量级人脸检测模型)
arm face-detection inference mnn ncnn
Language:Python 7454
gcanti / io-ts
Runtime type system for IO decoding/encoding
inference runtime types typescript validation
Language:TypeScript 6806
LMCache / LMCache
Supercharge Your LLM with the Fastest KV Cache Layer
amd cuda fast inference kv-cache llm pytorch rocm speed vllm
Language:Python 5918
Trusted-AI / adversarial-robustness-toolbox
Adversarial Robustness Toolbox (ART) - Python Library for Machine Learning Security - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams
python attack adversarial-machine-learning poisoning trusted-ai artificial-intelligence extraction adversarial-attacks adversarial-examples evasion inference privacy ai trustworthy-ai red-team blue-team machine-learning
Language:Python 5638
superduper
superduper-io / superduper
Superduper: End-to-end framework for building custom AI applications and agents.
ai chatbot data database distributed-ml inference llm-inference llm-serving llmops ml mlops mongodb pretrained-models python pytorch rag semantic-search torch transformers vector-search
Language:Python 5222
argmaxinc / WhisperKit
On-device Speech Recognition for Apple Silicon
inference ios macos speech-recognition swift transformers visionos watchos whisper
Language:Swift 5175
OpenCSGs / csghub
CSGHub is a brand-new open-source platform for managing LLMs, developed by the OpenCSG team. It offers both open-source and on-premise/SaaS solutions, with features comparable to Hugging Face. Gain full control over the lifecycle of LLMs, datasets, and agents, with Python SDK compatibility with Hugging Face. Join us! ⭐️
ai huggingface llm management-system platform asset-management dataset deepseek deploy finetune git inference model prompt ray space
Language:Vue 5030
AutoGPTQ / AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
deep-learning inference large-language-models llms nlp pytorch quantization transformer transformers
Language:Python 4983
NVIDIA-AI-IOT / torch2trt
An easy to use PyTorch to TensorRT converter
classification inference jetson-nano jetson-tx2 jetson-xavier pytorch tensorrt
Language:Python 4826
cube-studio
tencentmusic / cube-studio
cube studio开源云原生一站式机器学习/深度学习/大模型AI平台，mlops算法链路全流程，算力租赁平台，notebook在线开发，拖拉拽任务流pipeline编排，多机多卡分布式训练，超参搜索，推理服务VGPU虚拟化，边缘计算，标注平台自动化标注，deepseek等大模型sft微调/奖励模型/强化学习训练，vllm/ollama/mindie大模型多机推理，私有知识库，AI模型市场，支持国产cpu/gpu/npu 昇腾生态，支持RDMA，支持pytorch/tf/mxnet/deepspeed/paddle/colossalai/horovod/ray/volcano等分布式
ai aihub argo automl deepseek gpt inference kubeflow kubernetes llmops mlops notebook pipeline pytorch spark vgpu workflow
Language:Python 4668
Tencent / TNN
TNN: developed by Tencent Youtu Lab and Guangying Lab, a uniform deep learning inference framework for mobile、desktop and server. TNN is distinguished by several outstanding features, including its cross-platform capability, high performance, model compression and code pruning. Based on ncnn and Rapidnet, TNN further strengthens the support and performance optimization for mobile devices, and also draws on the advantages of good extensibility and high performance from existed open source efforts. TNN has been deployed in multiple Apps from Tencent, such as Mobile QQ, Weishi, Pitu, etc. Contributions are welcome to work in collaborative with us and make TNN a better framework.
coreml deep-learning face-detection hairsegmentaion inference mnn ncnn ocr openvino pytorch tengine tensorflow tensorrt
Language:C++ 4590
openvinotoolkit / open_model_zoo
Pre-trained Deep Learning models and demos (high quality and extremely fast)
models caffemodel demo tensorflow-models model-zoo model deep-learning-models cnn-model openvino inference onnx-models pytorch-models openvino-model-zoo openvino-models openvino-toolkit
Language:Python 4310
Mooncake
kvcache-ai / Mooncake
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
disaggregation inference kvcache llm rdma sglang vllm
Language:C++ 4236
OpenNMT / CTranslate2
Fast inference engine for Transformer models
neural-machine-translation cpp mkl quantization cuda thrust opennmt deep-neural-networks openmp onednn intrinsics avx2 avx parallel-computing gemm neon transformer-models machine-translation deep-learning inference
Language:C++ 4121
typedb
typedb / typedb
TypeDB: the power of programming, in your database
database inference knowledge-base knowledge-representation logic polymorphic polymorphism reasoning strongly-typed type-system typedb typeql
Language:Rust 4093
gpustack / gpustack
Simple, scalable AI model deployment on GPU clusters
ascend cuda deepseek distributed-inference genai inference llama llamacpp llm maas metal openai qwen rocm vllm mindie llm-inference llm-serving local-ai heterogeneous-cluster
Language:Python 3969
PaddlePaddle / FastDeploy
High-performance Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle
ernie ernie-45 ernie-45-vl inference llm llm-serving openai serving vllm
Language:Python 3554

inference

vllm-project / vllm

ggml-org / whisper.cpp

hpcaitech / ColossalAI

deepspeedai / DeepSpeed

google-ai-edge / mediapipe

Tencent / ncnn

sgl-project / sglang

SYSTRAN / faster-whisper

stas00 / ml-engineering

gvergnaud / ts-pattern

NVIDIA / TensorRT

aws / amazon-sagemaker-examples

huggingface / text-generation-inference

triton-inference-server / server

openvinotoolkit / openvino

xorbitsai / inference

oumi-ai / oumi

dusty-nv / jetson-inference

GeeeekExplorer / nano-vllm

Linzaer / Ultra-Light-Fast-Generic-Face-Detector-1MB

gcanti / io-ts

LMCache / LMCache

Trusted-AI / adversarial-robustness-toolbox

superduper-io / superduper

argmaxinc / WhisperKit

OpenCSGs / csghub

AutoGPTQ / AutoGPTQ

NVIDIA-AI-IOT / torch2trt

tencentmusic / cube-studio

Tencent / TNN

openvinotoolkit / open_model_zoo

kvcache-ai / Mooncake

OpenNMT / CTranslate2

typedb / typedb

gpustack / gpustack

PaddlePaddle / FastDeploy