yuyux9 / issho

A text generation web ui that doesn't suck, for models like LLaMA

Home Page:https://issho.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

An open-source, less-sucky text generation web UI for Large Language Models like LLaMA.

Try it here: Issho

Features

  • Can run LLaMA 30B model (4-bit) on a 4090 with no out of memory problems
  • Persistent settings: Prompt and settings are saved in localstorage

Why not oobabooga/text-generation-webui ?

(Complaints as of March 2023 version of oobabooga)

  • UI bugs: In oobabooga, output in UI is wrong if input is being truncated due to length. The generated output may not even be shown. This is a serious bug that affects usability for writing long stories. In general, the oobabooga UI is clunky and not fun to use
  • Out of memory errors: In oobabooga, inference with long prompts and borderline amounts of VRAM will lead to CUDA out of memory error in PyTorch. This is due to various reasons, but largely memory fragmentation. Issho doesn't have this problem. It can run a 30B model on a RTX 4090 with no issues, even with prompts that span the entire context window.
  • Only support 1 user: Issho supports placing users into a waiting queue.
  • Synchronous: In oobabooga, inference calls are blocking. Issho does inference in an asyncio event loop. This event loop also handles multiple user websocket connections.

Setup

Frontend

yarn install
yarn dev # dev build
yarn build && yarn start # prod build

Backend

Dependencies

sudo apt install build-essential
git submodule update --init --recursive
conda create -n textgen python=3.10.9
conda activate textgen
mamba install pytorch-cuda=11.8 -c pytorch -c nvidia
mamba install -c "nvidia/label/cuda-11.8.0" cuda-toolkit # nvcc should be version 11.8
cd repositories/GPTQ-for-LLaMa
python setup_cuda.py install
pip install -r requirements.txt

Models (example: LLaMA 30B 4bit)

mkdir models/
cd models/
wget https://raw.githubusercontent.com/qwopqwop200/GPTQ-for-LLaMa/triton/convert_llama_weights_to_hf.py
python convert_llama_weights_to_hf.py --input_dir /path/to/downloaded/llama/weights --model_size 30B --output_dir ./llama-30b
wget https://huggingface.co/maderix/llama-65b-4bit/resolve/main/llama30b-4bit.pt
ls # output: "llama-30b  llama-30b-4bit.pt"

Nginx

Set up reverse proxy to point / to Next.js frontend, point /ws to backend websocket server

About

A text generation web ui that doesn't suck, for models like LLaMA

https://issho.ai


Languages

Language:Python 55.9%Language:TypeScript 28.8%Language:CSS 15.1%Language:JavaScript 0.3%