There are 5 repositories under inference-optimization topic.
High-efficiency floating-point neural network inference operators for mobile, server, and Web
BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads.
Everything you need to know about LLM inference
[MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration
Batch normalization fusion for PyTorch. This is an archived repository, which is not maintained.
Optimize layers structure of Keras model to reduce computation time
A set of tool which would make your life easier with Tensorrt and Onnxruntime. This Repo is designed for YoloV3
Official Repo for SparseLLM: Global Pruning of LLMs (NeurIPS 2024)
[CVPR 2025] DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
Learn the ins and outs of efficiently serving Large Language Models (LLMs). Dive into optimization techniques, including KV caching and Low Rank Adapters (LoRA), and gain hands-on experience with Predibaseās LoRAX framework inference server.
Optimizing Monocular Depth Estimation with TensorRT: Model Conversion, Inference Acceleration, and 3D Reconstruction
Accelerating LLM inference with techniques like speculative decoding, quantization, and kernel fusion, focusing on implementing state-of-the-art research papers.
A template for getting started writing code using GGML
Faster inference YOLOv8: Optimize and export YOLOv8 models for faster inference using OpenVINO and Numpy š¢
Dynamic Attention Mask (DAM) generate adaptive sparse attention masks per layer and head for Transformer models, enabling long-context inference with lower compute and memory overhead without fine-tuning.
LLM-Rank: A graph theoretical approach to structured pruning of large language models based on weighted Page Rank centrality as introduced by the related paper.
A constrained expectation-maximization algorithm for feasible graph inference.
Your AI Catalyst: inference backend to maximize your model's inference performance
Multimodal-OCR3 is an advanced Optical Character Recognition (OCR) application that leverages multiple state-of-the-art multimodal models to extract text from images.
TensorRT in Practice: Model Conversion, Extension, and Advanced Inference Optimization
š¤ļø Optimized CUDA Kernels for Fast MobileNetV2 Inference
Official implementation of "SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching" (COLM 2025). A novel KV cache compression method that organizes cache at sentence level using semantic similarity.
MLP-Rank: A graph theoretical approach to structured pruning of deep neural networks based on weighted Page Rank centrality as introduced by the related thesis.
š§ A comprehensive toolkit for benchmarking, optimizing, and deploying local Large Language Models. Includes performance testing tools, optimized configurations for CPU/GPU/hybrid setups, and detailed guides to maximize LLM performance on your hardware.
Batch Partitioning for Multi-PE Inference with TVM (2020)
YOLOV8 - Object detection
Super Ollama Load Balancer - Performance-aware routing for distributed Ollama deployments with Ray, Dask, and adaptive metrics
PyTorch Mobile: Android examples of usage in applications
MIVisionX Python Inference Analyzer uses pre-trained ONNX/NNEF/Caffe models to analyze inference results and summarize individual image results
This repo integrates DyCoke's token compression method with VLMs such as Gemma3 and InternVL3
Interface for TensorRT engines inference along with an example of YOLOv4 engine being used.
Leveraging torch.compile to accelerate cross-encoder inference