llm-inference

There are 93 repositories under llm-inference topic.

nomic-ai / gpt4all
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
ai-chat llm-inference
Language:C++ 76676
ray-project / ray
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
data-science deep-learning deployment distributed hyperparameter-optimization hyperparameter-search large-language-models llm llm-inference llm-serving machine-learning optimization parallel python pytorch ray reinforcement-learning rllib serving tensorflow
Language:Python 38938
gitleaks / gitleaks
Find secrets with Gitleaks 🔑
security security-tools git golang go secret gitleaks devsecops hacktoberfest ci-cd cicd cli data-loss-prevention dlp open-source ai-powered llm llm-inference llm-training
Language:Go 23235
liguodongiot / llm-action
本项目旨在分享大模型相关技术原理以及实战经验（大模型工程化、大模型应用落地）
llm llm-inference llm-serving llm-training llmops
Language:HTML 20773
Lightning-AI / litgpt
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
ai artificial-intelligence deep-learning large-language-models llm llm-inference llms
Language:Python 12749
bentoml / OpenLLM
Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.
llm llmops model-inference fine-tuning llm-serving llama vicuna bentoml llama2 llm-inference llm-ops open-source-llm openllm mistral mlops llama3-1 llama3-2 llama3-2-vision
Language:Python 11777
mistralai / mistral-inference
Official inference library for Mistral models
llm llm-inference mistralai
Language:Jupyter Notebook 10465
openvinotoolkit / openvino
OpenVINO™ is an open source toolkit for optimizing and deploying AI inference
inference deep-learning openvino ai computer-vision diffusion-models generative-ai llm-inference natural-language-processing nlp performance-boost speech-recognition stable-diffusion deploy-ai optimize-ai transformers yolo recommendation-system good-first-issue
Language:C++ 8821
SJTU-IPADS / PowerInfer
High-speed Large Language Model Serving for Local Deployment
large-language-models llama llm llm-inference local-inference
Language:C++ 8331
BentoML
bentoml / BentoML
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
model-serving mlops llmops generative-ai llm-inference model-inference-service inference-platform deep-learning llm-serving machine-learning python multimodal ml-engineering llm ai-inference
Language:Python 8073
duixcom / Duix-Mobile
🚀 全网效果最好的移动端【实时对话数字人】。支持本地部署、多模态交互（语音、文本、表情），响应速度低于 1.5 秒，适用于直播、教学、客服、金融、政务等对隐私与实时性要求极高的场景。开箱即用，开发者友好。
ai-companion ai-girlfriend avatar chat-ui digital-human edge-ai llm-inference mobile-ai tts ai-boyfriend realtime-avatar
Language:C++ 7458
InternLM / lmdeploy
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
cuda-kernels deepspeed fastertransformer llm-inference turbomind internlm llama llm codellama llama2 llama3
Language:Python 7049
superduper
superduper-io / superduper
Superduper: End-to-end framework for building custom AI applications and agents.
ai mlops torch transformers mongodb python pytorch ml database data inference distributed-ml llm-inference pretrained-models chatbot semantic-search llm-serving llmops vector-search rag
Language:Python 5208
FellouAI / eko
Eko (Eko Keeps Operating) - Build Production-ready Agentic Workflow with Natural Language - eko.fellou.ai
agent agentic-ai agentic-framework agentic-workflow browseruse computeruse natural-language-inference workflow rag agentic-ai-development agents chain-of-thought genai llm-inference llmapi prompt-engineering llm-agents ai-agents browser-automation computer-automation
Language:TypeScript 4575
kserve / kserve
Standardized Distributed Generative and Predictive AI Inference Platform for Scalable, Multi-Framework Deployment on Kubernetes
knative machine-learning model-interpretability model-serving istio kubeflow artificial-intelligence tensorflow pytorch sklearn xgboost kubernetes k8s service-mesh kserve hacktoberfest mlops genai llm-inference
Language:Python 4543
Awesome-LLM-Inference
xlite-dev / Awesome-LLM-Inference
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
flash-attention paged-attention tensorrt-llm vllm awesome-llm llm-inference deepseek flash-attention-3 deepseek-v3 minimax-01 deepseek-r1 mla flash-mla qwen3
Language:Python 4514
codelion / openevolve
Open-source implementation of AlphaEvolve
alphacode coding-agent deepmind deepmind-lab discovery distributed-evolutionary-algorithms evolutionary-algorithms evolutionary-computation genetic-algorithm genetic-algorithms iterative-methods iterative-refinement llm-engineering llm-ensemble llm-inference optimize alpha-evolve alphaevolve openevolve
Language:Python 3895
flashinfer-ai / flashinfer
FlashInfer: Kernel Library for LLM Serving
gpu large-large-models cuda pytorch llm-inference jit attention nvidia distributed-inference moe
Language:Cuda 3742
gpustack / gpustack
Simple, scalable AI model deployment on GPU clusters
ascend cuda deepseek distributed-inference genai inference llama llamacpp llm maas metal openai qwen rocm vllm mindie llm-inference llm-serving local-ai heterogeneous-cluster
Language:Python 3705
katanemo / archgw
The smart edge and AI gateway for agents. Arch is a high-performance proxy server that handles the low-level work in building agents: like applying guardrails, routing prompts to the right agent, and unifying access to LLMs, etc. Natively designed to handle and process prompts, Arch helps you build agents faster.
gateway generative-ai llm-inference llms prompt proxy proxy-server llmops openai routing ai-gateway llm-gateway llm-routing envoy envoyproxy ai-gateway-support llm-proxy
Language:Rust 3674
predibase / lorax
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
fine-tuning gpt llama llm llm-inference llm-serving llmops lora model-serving pytorch transformers
Language:Python 3418
NVIDIA / GenerativeAIExamples
Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
gpu-acceleration large-language-models llm llm-inference microservice nemo rag retrieval-augmented-generation tensorrt triton-inference-server
Language:Jupyter Notebook 3404
deepsparse
neuralmagic / deepsparse
Sparsity-aware deep learning inference runtime for CPUs
machinelearning onnx inference computer-vision object-detection pruning quantization pretrained-models nlp cpus sparsification llm-inference performance deepsparse
Language:Python 3158
cactus-compute / cactus
Run AI locally on phones and AI-native devices
android framework ios llamacpp llm llm-inference llms transformer ai edge mobile smartphone
Language:C++ 3011
codelion / optillm
Optimizing inference proxy for LLMs
agent agentic-ai agentic-workflow agents api-gateway genai large-language-models llm llm-inference llmapi mixture-of-experts moa openai openai-api optimization proxy-server agentic-framework chain-of-thought monte-carlo-tree-search prompt-engineering
Language:Python 2870
distributed-llama
b4rtaz / distributed-llama
Distributed LLM inference. Connect home devices into a powerful cluster to accelerate LLM inference. More devices means faster inference.
distributed-computing llama2 llm llm-inference neural-network llms open-llm distributed-llm llama3
Language:C++ 2628
FasterDecoding / Medusa
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
llm llm-inference
Language:Jupyter Notebook 2618
databricks / dbrx
Code examples and resources for DBRX, a large language model developed by Databricks
databricks gen-ai generative-ai llm llm-inference llm-training mosaic-ai
Language:Python 2571
intel / intel-extension-for-transformers
⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡
large-language-model chatbot 4-bits llm-inference llm-cpu chatpdf streamingllm intel-optimized-llamacpp speculative-decoding neural-chat habana neural-chat-7b rag retrieval autoround gaudi3
Language:Python 2170
microsoft / aici
AICI: Prompts as (Wasm) Programs
ai rust wasm wasmtime inference language-model llm llm-framework llm-inference llm-serving llmops model-serving transformer
Language:Rust 2050
liltom-eth / llama2-webui
Run any Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). Use `llama2-wrapper` as your local llama2 backend for Generative Agents/Apps.
llama-2 llama2 llm llm-inference
Language:Jupyter Notebook 1957
SafeAILab / EAGLE
Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.
large-language-models llm-inference speculative-decoding
Language:Python 1780
beam-cloud / beta9
Ultrafast serverless GPU inference, sandboxes, and background jobs
gpu ml-platform cuda fine-tuning generative-ai large-language-models llm distributed-computing llm-inference self-hosted autoscaler cloudrun developer-productivity faas functions-as-a-service paas serverless serverless-containers
Language:Go 1298
lemonade-sdk / lemonade
Lemonade helps users run local LLMs with the highest performance by configuring state-of-the-art inference engines for their NPUs and GPUs. Join our discord: https://discord.gg/5xXzkMu8Zk
amd llama llm llm-inference llms local-server mistral npu onnxruntime qwen openai-api mcp mcp-server gpu radeon ryzen vulkan ai genai
Language:Python 1264
sauravpanda / BrowserAI
Run local LLMs like llama, deepseek-distill, kokoro and more inside your browser
ai llm-inference webgpu agents llm tts localllm llama local
Language:TypeScript 1219
taielab / awesome-hacking-lists
A curated collection of top-tier penetration testing tools and productivity utilities across multiple domains. Join us to explore, contribute, and enhance your hacking toolkit!
web hacking awesome-list hacker hacking-tool kali-scripts hacking-tools pentesting-tools pentest-scripts bounty-hunters bug-bounty bugbounty bugbounty-tool agents ai aiagent llm llm-inference mcp mcp-server
1192