inference llama2 llama3 llm mistral phi3-vision

Large Language Models (LLMs) Inference

Setting up LLM inference services within data centers and/or on-premise environments.

Large language models (LLMs) are a powerful tool with the potential to revolutionize a wide range of industries. However, deploying and managing LLMs on-premise can be a complex and challenging task. This repo provides ready to deploy configuration and python code for setting up of llm inference servers. This includes REST API and web interface to chat with llm models. The implementation is based on docker containers.

The focus is primarily on runtime inference, it does not cater for fine-tuning and training of llm. The model serving includes original model and/or quantized versions.

Three Tier Architecture - LLM Inference:

Three tier architecture for llm inference is used to perform on premise deployment. This architecture allows greater flexibility and agility. It is assumed that on premise hosting infrastructure is behind firewalls with no outbound connectivity to internet as part of security policies.

Backend llm inference server
Web application server
Front-end using web browser

About

Setting up LLM inference On-Premise Environments

inference llama2 llama3 llm mistral phi3-vision

GNU General Public License v3.0

Languages

Language:Python 90.6%Language:Dockerfile 9.4%