There are 1 repository under vllm topic.
Scripts for fine-tuning Meta Llama3 with composable FSDP & PEFT methods to cover single/multi-node GPUs. Supports default & custom datasets for applications such as summarization and Q&A. Supporting a number of candid inference solutions such as HF TGI, VLLM for local or cloud deployment. Demo apps to showcase Meta Llama3 for WhatsApp & Messenger.
Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
Reasoning in Large Language Models: Papers and Resources, including Chain-of-Thought, Instruction-Tuning and Multimodality.
🔒 Enterprise-grade API gateway that helps you monitor and impose cost or rate limits per API key. Get fine-grained access control and monitoring per user, application, or environment. Supports OpenAI, Azure OpenAI, Anthropic, vLLM, and open-source LLMs.
Low latency JSON generation using LLMs ⚡️
The RunPod worker template for serving our large language model endpoints. Powered by vLLM.
An endpoint server for efficiently serving quantized open-source LLMs for code.
llm-inference is a platform for publishing and managing llm inference, providing a wide range of out-of-the-box features for model deployment, such as UI, RESTful API, auto-scaling, computing resource management, monitoring, and more.
A simple service that integrates vLLM with Ray Serve for fast and scalable LLM serving.
Dockerized LLM inference server with constrained output (JSON mode), built on top of vLLM and outlines. Faster, cheaper and without rate limits. Compare the quality and latency to your current LLM API provider.
A REST API for vLLM, production ready
Carbon Limiting Auto Tuning for Kubernetes
This repository demonstrates LLM execution on CPUs using packages like llamafile, emphasizing low-latency, high-throughput, and cost-effective benefits for inference and serving.
Fully-featured, beautiful web interface for vLLM - built with NextJS.
Run code inference-only benchmarks quickly using vLLM
Ready-to-deploy Docker image for Functionary LLM served as an OpenAI-Compatible API.
Pre-loaded LLMs served as an OpenAI-Compatible API via Docker images.
An simple implementation of Unet because all the implementations i've seen are wayy tooo complicated.
Preserving entities through the integration of knowledge graphs, Llama 2, vLLM, and LangChain.
A Large Language Model based tool for generating human like responses to natural language inputs for network not connected over internet.
Cog wrapper for cognitivecomputations/Wizard-Vicuna-13B-Uncensored