inference-optimization

There are 5 repositories under inference-optimization topic.

google / XNNPACK
High-efficiency floating-point neural network inference operators for mobile, server, and Web
neural-networks inference inference-optimization simd cpu multithreading matrix-multiplication convolutional-neural-networks convolutional-neural-network neural-network mobile-inference
Language:C 2156
alibaba / BladeDISC
BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads.
compiler deep-learning machine-learning pytorch tensorflow inference-optimization mlir neural-network
Language:C++ 903
jiazhihao / TASO
The Tensor Algebra SuperOptimizer for Deep Learning
deep-learning deep-neural-networks inference-optimization
Language:C++ 730
bentoml / llm-inference-handbook
Everything you need to know about LLM inference
llm llm-inference inference-handbook inference-optimization inference-infrastructure
Language:TypeScript 236
mit-han-lab / inter-operator-scheduler
[MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration
acceleration cnn inference-optimization parallelism
Language:C++ 200
imedslab / pytorch_bn_fusion
Batch normalization fusion for PyTorch. This is an archived repository, which is not maintained.
pytorch inference-optimization batch-normalization deep-learning deep-neural-networks
Language:Python 196
ZFTurbo / Keras-inference-time-optimizer
Optimize layers structure of Keras model to reduce computation time
keras inference-optimization
Language:Python 157
Rapternmn / PyTorch-Onnx-Tensorrt
A set of tool which would make your life easier with Tensorrt and Onnxruntime. This Repo is designed for YoloV3
tensorrt onnxruntime onnx onnx-torch pytorch yolov3 inference-optimization darknet
Language:Python 80
BaiTheBest / SparseLLM
Official Repo for SparseLLM: Global Pruning of LLMs (NeurIPS 2024)
pruning inference-optimization efficient-ai large-language-models alternating-optimization model-compression
Language:Python 67
vbdi / divprune
[CVPR 2025] DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
inference-optimization llm multimodal-large-language-models pruning vision-language-model llava token-pruning multi-modality
Language:Python 52
AGI-Study
keli-wen / AGI-Study
The blog, read report and code example for AGI/LLM related knowledge.
code-examples demo inference-optimization llm train
Language:Python 48
ksm26 / Efficiently-Serving-LLMs
Learn the ins and outs of efficiently serving Large Language Models (LLMs). Dive into optimization techniques, including KV caching and Low Rank Adapters (LoRA), and gain hands-on experience with Predibase’s LoRAX framework inference server.
batch-processing deep-learning-techniques inference-optimization machine-learning-operations model-acceleration model-inference-service model-serving optimization-techniques performance-enhancement scalability-strategies server-optimization text-generation large-scale-deployment serving-infrastructure
Language:Jupyter Notebook 17
lmaxwell / Armednn
cross-platform modular neural network inference library, small and efficient
inference-engine neural-network eigen eigen3 lstm conv1d inference-optimization
Language:C++ 13
yester31 / Monocular_Depth_Estimation_TRT
Optimizing Monocular Depth Estimation with TensorRT: Model Conversion, Inference Acceleration, and 3D Reconstruction
2dto3d depth-anything-v2 depth-estimation depth-pro inference-optimization monocular-depth-estimation optimization tensorrt raft vggt depth-anything-ac distill-any-depth meflow memfof metric3dv2 moge-2 neuflow streamvggt unidepthv2 unik3d
Language:Python 12
ccs96307 / fast-llm-inference
Accelerating LLM inference with techniques like speculative decoding, quantization, and kernel fusion, focusing on implementing state-of-the-art research papers.
acceleration inference-optimization large-language-models speculative-decoding
Language:Python 10
grazder / template.cpp
A template for getting started writing code using GGML
cpp ggml deep-learning inference-optimization
Language:C++ 10
Harly-1506 / Faster-Inference-yolov8
Faster inference YOLOv8: Optimize and export YOLOv8 models for faster inference using OpenVINO and Numpy 🔢
numpy-implementation object-detection openvino openvino-toolkit segmentation yolov8 image-processing inference-optimization numpy-arrays opencv torch ultralytics
Language:Python 10
ResponsibleAILab / DAM
Dynamic Attention Mask (DAM) generate adaptive sparse attention masks per layer and head for Transformer models, enabling long-context inference with lower compute and memory overhead without fine-tuning.
efficient-ai inference-optimization sparse-attention
Language:Python 10
amazon-science / llm-rank-pruning
LLM-Rank: A graph theoretical approach to structured pruning of large language models based on weighted Page Rank centrality as introduced by the related paper.
graph-theory inference-optimization large-language-models llm llms pagerank pruning weighted-pagerank
Language:Python 7
Bisonai / ncnn
Modified inference engine for quantized convolution using product quantization
edge-machine-learning quantization product-quantization mobile-deep-learning inference-optimization inference-acceleration
Language:C++ 4
effrosyni-papanastasiou / constrained-em
A constrained expectation-maximization algorithm for feasible graph inference.
expectation-maximization network-inference feasibility expectation-maximisation-algorithm inference-optimization
Language:Jupyter Notebook 4
EZ-Optimium / Optimium
Your AI Catalyst: inference backend to maximize your model's inference performance
ai-compiler amd arm deep-learning inference inference-engine inference-optimization intel neural-network runtime tensorflow-lite mediapipe raspberry-pi
Language:C++ 4
PRITHIVSAKTHIUR / Multimodal-OCR3
Multimodal-OCR3 is an advanced Optical Character Recognition (OCR) application that leverages multiple state-of-the-art multimodal models to extract text from images.
chandra-ocr dotsocr huggingface-models huggingface-spaces huggingface-transformers inference-optimization matplotlib multimodal-large-language-models nanonets ocr ocr-recognition olmocr2 pillow pytorch qwen2-5-vl qwen3-vl sota-model vision-language-model vision-transformer
Language:Python 4
yester31 / TensorRT_Examples
TensorRT in Practice: Model Conversion, Extension, and Advanced Inference Optimization
inference-optimization onnx post-training-quantization quantization-aware-training sparsity tensorrt depth-pro real-esrgan quantization pruning efficientvitsam sam2 yolov12 d-fine eomt
Language:Python 4
zhliuworks / Fast-MobileNetV2
🤖️ Optimized CUDA Kernels for Fast MobileNetV2 Inference
cuda-kernels inference-optimization mobilenet-v2
Language:Cuda 4
zzbright1998 / SentenceKV
Official implementation of "SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching" (COLM 2025). A novel KV cache compression method that organizes cache at sentence level using semantic similarity.
efficient-inference inference-optimization kv-cache llm memory-efficiency natural-language-processing semantic-caching transformers colm2025
Language:Python 4
amazon-science / mlp-rank-pruning
MLP-Rank: A graph theoretical approach to structured pruning of deep neural networks based on weighted Page Rank centrality as introduced by the related thesis.
centrality-measures graph-theory inference-optimization machine-learning multilayer-perceptron neural-network pagerank pruning structured-sparsity weighted-pagerank
Language:Python 3
BjornMelin / local-llm-workbench
🧠 A comprehensive toolkit for benchmarking, optimizing, and deploying local Large Language Models. Includes performance testing tools, optimized configurations for CPU/GPU/hybrid setups, and detailed guides to maximize LLM performance on your hardware.
cpu-inference cuda gpu-acceleration inference-optimization llama-cpp llm-benchmarking llm-deployment local-llm model-management model-quantization context-window-scaling hybrid-inference ollama-optimization wsl-ai-setup
Language:Shell 3
batch-partitioning
sjlee25 / batch-partitioning
Batch Partitioning for Multi-PE Inference with TVM (2020)
tvm deep-learning data-parallelism dl-compiler dl-optimization inference-optimization
Language:Python 3
Wb-az / yolov8-disease-detection-agriculture
YOLOV8 - Object detection
average-precision inference-optimization live-streaming object-detection openvino-inference-engine openvino-toolkit optimization-algorithms pandas pytorch ray-tune ultralytics yolov8 computer-vision deep-learning
Language:Jupyter Notebook 3
BenevolentJoker-JohnL / SOLLOL
Super Ollama Load Balancer - Performance-aware routing for distributed Ollama deployments with Ray, Dask, and adaptive metrics
asyncio distributed-inference gpu-monitoring llama-cpp llm load-balancer machine-learning ollama python ray rpc ai model-serving multi-node hybrid-routing caching http2 inference-optimization performance streaming
Language:Python 2
cedrickchee / pytorch-mobile-android
PyTorch Mobile: Android examples of usage in applications
pytorch-mobile android-app libtorch machine-learning edge-ai inference-optimization
Language:Java 2
kiritigowda / mivisionx-inference-analyzer
MIVisionX Python Inference Analyzer uses pre-trained ONNX/NNEF/Caffe models to analyze inference results and summarize individual image results
openvx mivisionx onnx nnef caffe inference inference-engine inference-optimization amd amdgpu opencl rocm inceptionv4 resnet vgg resnet-50 squeezenet mivisionx-inference-analyzer docker-images nnir
Language:Python 2
mvish7 / dycoke_token_compression
This repo integrates DyCoke's token compression method with VLMs such as Gemma3 and InternVL3
inference-optimization token-compression video-large-language-models vlms
Language:Python 2
piotrostr / infer-trt
Interface for TensorRT engines inference along with an example of YOLOv4 engine being used.
deep-learning inference-optimization object-detection tensorrt
Language:Python 2
shreyansh26 / Accelerating-Cross-Encoder-Inference
Leveraging torch.compile to accelerate cross-encoder inference
cross-encoder inference-optimization jina mlsys torch-compile
Language:Python 2

inference-optimization

google / XNNPACK

alibaba / BladeDISC

jiazhihao / TASO

bentoml / llm-inference-handbook

mit-han-lab / inter-operator-scheduler

imedslab / pytorch_bn_fusion

ZFTurbo / Keras-inference-time-optimizer

Rapternmn / PyTorch-Onnx-Tensorrt

BaiTheBest / SparseLLM

vbdi / divprune

keli-wen / AGI-Study

ksm26 / Efficiently-Serving-LLMs

lmaxwell / Armednn

yester31 / Monocular_Depth_Estimation_TRT

ccs96307 / fast-llm-inference

grazder / template.cpp

Harly-1506 / Faster-Inference-yolov8

ResponsibleAILab / DAM

amazon-science / llm-rank-pruning

Bisonai / ncnn

effrosyni-papanastasiou / constrained-em

EZ-Optimium / Optimium

PRITHIVSAKTHIUR / Multimodal-OCR3

yester31 / TensorRT_Examples

zhliuworks / Fast-MobileNetV2

zzbright1998 / SentenceKV

amazon-science / mlp-rank-pruning

BjornMelin / local-llm-workbench

sjlee25 / batch-partitioning

Wb-az / yolov8-disease-detection-agriculture

BenevolentJoker-JohnL / SOLLOL

cedrickchee / pytorch-mobile-android

kiritigowda / mivisionx-inference-analyzer

mvish7 / dycoke_token_compression

piotrostr / infer-trt

shreyansh26 / Accelerating-Cross-Encoder-Inference