tyshiwo1 / GPT4Tools

GPT4Tools is an intelligent system that can automatically decide, control, and utilize different visual foundation models, allowing the user to interact with images during a conversation.

Home Page:http://gpt4tools.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

EFE: Empowering Large Language Models with Fine-grained Image Editing

Tengyao, Sun Kaiyue, Li Maomao

Our work is based on GPT4Tools (https://github.com/AILab-CVC/GPT4Tools), which is a centralized system that can control multiple visual foundation models. It is based on Vicuna (LLaMA), and 71K self-built instruction data. By analyzing the language content, GPT4Tools is capable of automatically deciding, controlling, and utilizing different visual foundation models, allowing the user to interact with images during a conversation. With this approach, GPT4Tools provides a seamless and efficient solution to fulfill various image-related requirements in a conversation. We empowere Large Language Models with MagicBrush image editing ability.


Data file name Size OneDrive Google Driver
gpt4tools_71k.json 229 MB link link
gpt4tools_val_seen.json -- link link
gpt4tools_test_unseen.json -- link link

gpt4tools_71k.json contains 71K instruction-following data we used for fine-tuning the GPT4Tools model.

The data collection process is illustrated below:
We fed GPT-3.5 with captions from 3K images and descriptions of 22 visual tasks. This produced 66K instructions, each corresponding to a specific visual task and a visual foundation model (tool). Subsequently, we eliminated duplicate instructions and retained 41K sound instructions. To teach the model to utilize tools in a predefined manner, we followed the prompt format used in Visual ChatGPT and converted these instructions into a conversational format. Concurrently, we generated negative data without tool usage by randomly sampling 3K instructions from alpaca_gpt4_data and converting them to the defined format. Using the generated 71K instructions, we finetuned the Vicuna using LoRA and got our GPT4Tools, which can automatically decide, control, and utilize distinct tools in a conversation.

gpt4tools_val_seen.json is the manually cleaned instruction data used for validation, which includes instructions related to tools of gpt4tools_71k.json.

gpt4tools_test_unseen.json cleaned instruction data used for testing, including instructions related to some tools that are absented in gpt4tools_71k.json.

Data Generation



GTP4Tools mainly contains three parts: LLM for instruction, LoRA for adaptation, and Visual Agent for provided functions. It is a flexible and extensible system that can be easily extended to support more tools and functions. For example, users can replace the existing LLM or tools with their own models, or add new tools to the system. The only things needed are finetuned the LoRA with the provided instruction, which teaches LLM to use the provided tools.



git clone /https://github.com/tyshiwo1/GPT4Tools
cd GPT4Tools
pip install -r requirements.txt


GPT4Tools is based on the Vicuna, we release the LoRA weights of GPT4Tools to comply with the LLaMA model license. You can merge our LoRA weights with the Vicuna weights to obtain the GPT4Tools weights.


  1. Get the original LLaMA weights in the Hugging Face format from here.
  2. Using the FastChat to get Vicuna weigths by applying the delta weights, more details please check here.
  3. Get the LoRA weights of GPT4Tools (Hugging Face, OneDrive, or Google Driver).

Serving with Web GUI

Making a gradio interface on your own devices:

# Advice for 1 GPU
python gpt4tools.py \
	--base_model <path_to_vicuna_with_tokenizer> \
	--lora_model <path_to_lora_weights> \
	--llm_device "cpu" \
	--load "Text2Box_cuda:0,Segmenting_cuda:0,Inpainting_cuda:0,ImageCaptioning_cuda:0"
# Advice for 4 GPUs
python gpt4tools.py \
	--base_model <path_to_vicuna_with_tokenizer> \
	--lora_model <path_to_lora_weights> \
	--llm_device "cuda:3" \
	--load "Text2Box_cuda:0,Segmenting_cuda:0,Inpainting_cuda:0,ImageCaptioning_cuda:0,   Text2Image_cuda:1,VisualQuestionAnswering_cuda:1,InstructPix2Pix_cuda:2,

You can customize the used tools by specifying {tools_name}_{devices} after args --load of gpt4tools.py. tools_name is illustrated in tools.md.

Finetuning with LoRA

# Training with 8 GPUs
torchrun --nproc_per_node=8 --master_port=29005 lora_finetune.py \
	--base_model <path_to_vicuna_with_tokenizer> \
	--data_path <path_to_gpt4tools_71k.json> \
	--output_dir output/gpt4tools \
	--prompt_template_name gpt4tools \
	--num_epochs 6 \
	--batch_size 512 \
	--cutoff_len 2048 \
	--group_by_length \
	--lora_target_modules '[q_proj,k_proj,v_proj,o_proj]' \
	--lora_r 16 \
Hyperparameter Global Batch Size Learning rate Max length Weight decay LoRA attention dimension (lora_r) LoRA scaling alpha(lora_alpha) LoRA dropout (lora_dropout) Modules to apply LoRA (lora_target_modules)
GPT4Tools & Vicuna-13B 512 3e-4 2048 0.0 16 16 0.05 [q_proj,k_proj,v_proj,o_proj]

Inference and Evaluation

  • Using 8 GPUs (recommendation)
bash scripts/batch_inference.sh 8  <path_to_vicuna_with_tokenizer> <path_to_lora_weights> <your_annotation_path> <name_to_save>
  • Using 1 GPU
python3 inference.py --base_model <path_to_vicuna_with_tokenizer> \
    --lora_model <path_to_lora_weights> \
    --ann_path <your_annotation_path> \
	--save_name <name_to_save> \
	--llm_device 'cuda'


python3 evaluate_result.py --ann_path <your_annotation_path> \
	--save_name <name_to_save>
  • Inference using GPT-3.5
python3 inference_chatgpt.py --ann_path <your_annotation_path> \
	--save_name <name_to_save> \
	--model 'davinci'

The openai api_key should be set in the env (OPENAI_API_KEY).

  • your_annotation_path is 'your_path/gpt4tools_val_seen.json' or 'your_path/gpt4tools_test_unseen.json'.


  • VisualChatGPT: It connects ChatGPT and a series of Visual Foundation Models to enable sending and receiving images during chatting.
  • Vicuna: The language ability of Vicuna is fantastic and amazing. And it is open-source!
  • Alpaca-LoRA: Instruct-tune LLaMA on consumer hardware.


GPT4Tools is an intelligent system that can automatically decide, control, and utilize different visual foundation models, allowing the user to interact with images during a conversation.




Language:Python 97.6%Language:Shell 2.4%