💬🚀 LLM as a Chatbot Service

The purpose of this repository is to let people to use lots of open sourced instruction-following fine-tuned LLM models as a Chatbot service. The currently focused models are LLaMA based Alpaca, StableLM based Alpaca, LLaMA based Dolly, and Flan based Alpaca. Because different models behave differently, and different models require to form prompts differently, I made a very simple library Ping Pong for model agnostic conversation and context managements.

🔗 Demo link: will host demos soon (7B Alpaca, 13B Alpaca, 7B StableLM)

The easiest way to run this project is to use Colab. Just open up the llm_as_chatbot_in_colab notebook in Colab (there is a button open in colab), and run every cell sequentially. With the standard GPU instance(T4), you can run 7B and 13B models. With the premium GPU instance(A100 40GB), you can even run 30B model! Screenshot👇🏼 Just note that the connection could be somewhat unstable, so I recommend you to use Colab for development purpose.

Mode

1. Stream generation mode: streaming mode handles multiple requests in a interleaving way with threads. For instance, if there are two users (A and B) are connected, A's request is handled, and then B's request is handled, and then A's request is handled again.... This is because of the nature of streaming mode which generates and yield tokens in one by one manner.

2. Batch generation mode: deprecated, but this mode will be revived soon.

Context management

Different model might have different strategies to manage context, so if you want to know the exact strategies applied to each model, take a look at the chats directory. However, here are the basic ideas that I have come up with initially. I have found long prompts will slow down the generation process a lot eventually, so I thought the prompts should be kept as short as possible while as concise as possible at the same time. In the previous version, I have accumulated all the past conversations, and that didn't go well.

In every turn of the conversation, the past N conversations will be kept. Think about the N as a hyper-parameter. As an experiment, currently the past 2-3 conversations are only kept for all models.
In every turn of the conversation, it summarizes or extract information. The summarized information will be given in the every next turn of conversation.

Currently supported models

tloen/alpaca-lora-7b: the original 7B Alpaca-LoRA checkpoint by tloen (updated by 4/4/2022)
chansung/alpaca-lora-13b: the 13B Alpaca-LoRA checkpoint by myself (chansung, updated by 4/4/2022)
chansung/alpaca-lora-30b: the 30B Alpaca-LoRA checkpoint by myself (chansung, updated by 4/4/2022)
chansung/alpaca-lora-65b: the 65B Alpaca-LoRA checkpoint by myself (chansung)
stabilityai/stablelm-tuned-alpha-7b: StableLM based fine-tuned model
beomi/KoAlpaca-Polyglot-12.8B: Polyglot based Alpaca style instruction fine-tuned model

Instructions

Prerequisites

Note that the code only works Python >= 3.9

$ conda create -n llm-serve python=3.9
$ conda activate llm-serve

Install dependencies. Update gradio version as needed(gradio > 𝟹.𝟸𝟻 will display code blocks correctly)

$ cd LLM-As-Chatbot
$ pip install -r requirements.txt

Run Gradio application

### for Alpaca 7B 
$ BASE_URL=decapoda-research/llama-7b-hf
$ FINETUNED_CKPT_URL=tloen/alpaca-lora-7b
$ GEN_CONFIG=configs/gen_config_default.yaml
$ SUMMARIZE_GEN_CONFIG=configs/gen_config_summarization_default.yaml

$ python app.py --base-url $BASE_URL \
  --ft-ckpt-url $FINETUNED_CKPT_URL \
  --gen-config-path $GEN_CONFIG \
  --gen-config-summarization-path $SUMMARIZE_GEN_CONFIG
  
### for StableLM 7B   
$ BASE_URL=stabilityai/stablelm-tuned-alpha-7b
$ GEN_CONFIG=configs/gen_config_stablelm.yaml
$ SUMMARIZE_GEN_CONFIG=configs/gen_config_summarization_stablelm.yaml

$ python app.py --base-url $BASE_URL \
  --ft-ckpt-url $FINETUNED_CKPT_URL \
  --gen-config-path $GEN_CONFIG \
  --gen-config-summarization-path $SUMMARIZE_GEN_CONFIG

the following flags are supported

usage: app.py [-h] [--base-url BASE_URL] [--ft-ckpt-url FT_CKPT_URL] [--port PORT] [--share] [--gen-config-path GEN_CONFIG_PATH] [--gen-config-summarization-path GEN_CONFIG_SUMMARIZATION_PATH] [--multi-gpu] [--force-download_ckpt] [--chat-only-mode]

Gradio Application for LLM as a chatbot service

options:
  -h, --help            show this help message and exit
  --base-url BASE_URL   Hugging Face Hub URL
  --ft-ckpt-url FT_CKPT_URL
                        Hugging Face Hub URL
  --port PORT           PORT number where the app is served
  --share               Create and share temporary endpoint (useful in Colab env)
  --gen-config-path GEN_CONFIG_PATH
                        path to GenerationConfig file
  --gen-config-summarization-path GEN_CONFIG_SUMMARIZATION_PATH
                        path to GenerationConfig file used in context summarization
  --multi-gpu           Enable multi gpu mode. This will force not to use Int8 but float16, so you need to check if your system has enough GPU memory
  --force-download_ckpt
                        Force to download ckpt instead of using cached one
  --chat-only-mode      Only show chatting window. Otherwise, other components will be appeared for more sophisticated control

Todos

Gradio components to control the configurations of the generation
LLaMA based Dolly and Flan based Alpaca models
Multiple conversation managements
Implement server only option w/ FastAPI
ChatGPT's plugin like features

Acknowledgements

I am thankful to Jarvislabs.ai who generously provided free GPU resources to experiment with Alpaca-LoRA deployment and share it to communities to try out.
I am thankful to Common Computer who generously provided A100(40G) x 8 DGX workstation for fine-tuning the models.

taegyun0922 / LLM-As-Chatbot