KennyLeung / liteLLM-proxy

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

liteLLM Proxy Server: 50+ LLM Models, Error Handling, Caching

Azure, Llama2, OpenAI, Claude, Hugging Face, Replicate Models

PyPI Version PyPI Version Downloads litellm

Deploy on Railway

What does liteLLM proxy do

  • Make /chat/completions requests for 50+ LLM models Azure, OpenAI, Replicate, Anthropic, Hugging Face

    Example: for model use claude-2, gpt-3.5, gpt-4, command-nightly, stabilityai/stablecode-completion-alpha-3b-4k

    {
      "model": "replicate/llama-2-70b-chat:2c1608e18606fad2812020dc541930f2d0495ce32eee50074220b87300bc16e1",
      "messages": [
                      { 
                          "content": "Hello, whats the weather in San Francisco??",
                          "role": "user"
                      }
                  ]
    }
  • Consistent Input/Output Format

    • Call all models using the OpenAI format - completion(model, messages)
    • Text responses will always be available at ['choices'][0]['message']['content']
  • Error Handling Using Model Fallbacks (if GPT-4 fails, try llama2)

  • Logging - Log Requests, Responses and Errors to Supabase, Posthog, Mixpanel, Sentry, Helicone (Any of the supported providers here: https://litellm.readthedocs.io/en/latest/advanced/

Example: Logs sent to Supabase Screenshot 2023-08-11 at 4 02 46 PM

  • Token Usage & Spend - Track Input + Completion tokens used + Spend/model
  • Caching - Implementation of Semantic Caching
  • Streaming & Async Support - Return generators to stream text responses

API Endpoints

/chat/completions (POST)

This endpoint is used to generate chat completions for 50+ support LLM API Models. Use llama2, GPT-4, Claude2 etc

Input

This API endpoint accepts all inputs in raw JSON and expects the following inputs

  • model (string, required): ID of the model to use for chat completions. See all supported models [here]: (https://litellm.readthedocs.io/en/latest/supported/): eg gpt-3.5-turbo, gpt-4, claude-2, command-nightly, stabilityai/stablecode-completion-alpha-3b-4k
  • messages (array, required): A list of messages representing the conversation context. Each message should have a role (system, user, assistant, or function), content (message text), and name (for function role).
  • Additional Optional parameters: temperature, functions, function_call, top_p, n, stream. See the full list of supported inputs here: https://litellm.readthedocs.io/en/latest/input/

Example JSON body

For claude-2

{
    "model": "claude-2",
    "messages": [
                    { 
                        "content": "Hello, whats the weather in San Francisco??",
                        "role": "user"
                    }
                ]
    
}

Making an API request to the Proxy Server

import requests
import json

# TODO: use your URL 
url = "http://localhost:5000/chat/completions"

payload = json.dumps({
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "content": "Hello, whats the weather in San Francisco??",
      "role": "user"
    }
  ]
})
headers = {
  'Content-Type': 'application/json'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)

Output [Response Format]

Responses from the server are given in the following format. All responses from the server are returned in the following format (for all LLM models). More info on output here: https://litellm.readthedocs.io/en/latest/output/

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "content": "I'm sorry, but I don't have the capability to provide real-time weather information. However, you can easily check the weather in San Francisco by searching online or using a weather app on your phone.",
                "role": "assistant"
            }
        }
    ],
    "created": 1691790381,
    "id": "chatcmpl-7mUFZlOEgdohHRDx2UpYPRTejirzb",
    "model": "gpt-3.5-turbo-0613",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 41,
        "prompt_tokens": 16,
        "total_tokens": 57
    }
}

Installation & Usage

Running Locally

  1. Clone liteLLM repository to your local machine:
    git clone https://github.com/BerriAI/liteLLM-proxy
    
  2. Install the required dependencies using pip
    pip install requirements.txt
    
  3. Set your LLM API keys
    os.environ['OPENAI_API_KEY]` = "YOUR_API_KEY"
    or
    set OPENAI_API_KEY in your .env file
    
  4. Run the server:
    python main.py
    

Deploying

  1. Quick Start: Deploy on Railway

    Deploy on Railway

  2. GCP, AWS, Azure This project includes a Dockerfile allowing you to build and deploy a Docker Project on your providers

About

License:MIT License


Languages

Language:Python 56.3%Language:JavaScript 41.0%Language:CSS 1.3%Language:Dockerfile 1.3%