There are 93 repositories under llm-inference topic.
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.
OpenVINO™ is an open source toolkit for optimizing and deploying AI inference
High-speed Large Language Model Serving for Local Deployment
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
🚀 全网效果最好的移动端【实时对话数字人】。 支持本地部署、多模态交互(语音、文本、表情),响应速度低于 1.5 秒,适用于直播、教学、客服、金融、政务等对隐私与实时性要求极高的场景。开箱即用,开发者友好。
Superduper: End-to-end framework for building custom AI applications and agents.
Eko (Eko Keeps Operating) - Build Production-ready Agentic Workflow with Natural Language - eko.fellou.ai
Standardized Distributed Generative and Predictive AI Inference Platform for Scalable, Multi-Framework Deployment on Kubernetes
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
Open-source implementation of AlphaEvolve
FlashInfer: Kernel Library for LLM Serving
The smart edge and AI gateway for agents. Arch is a high-performance proxy server that handles the low-level work in building agents: like applying guardrails, routing prompts to the right agent, and unifying access to LLMs, etc. Natively designed to handle and process prompts, Arch helps you build agents faster.
Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
Sparsity-aware deep learning inference runtime for CPUs
Run AI locally on phones and AI-native devices
Distributed LLM inference. Connect home devices into a powerful cluster to accelerate LLM inference. More devices means faster inference.
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
Code examples and resources for DBRX, a large language model developed by Databricks
⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡
Run any Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). Use `llama2-wrapper` as your local llama2 backend for Generative Agents/Apps.
Ultrafast serverless GPU inference, sandboxes, and background jobs
Lemonade helps users run local LLMs with the highest performance by configuring state-of-the-art inference engines for their NPUs and GPUs. Join our discord: https://discord.gg/5xXzkMu8Zk
Run local LLMs like llama, deepseek-distill, kokoro and more inside your browser
A curated collection of top-tier penetration testing tools and productivity utilities across multiple domains. Join us to explore, contribute, and enhance your hacking toolkit!