Welcome to "heron" repository. Heron is a library that seamlessly integrates multiple Vision and Language models, as well as Video and Language models. One of its standout features is its support for Japanese V&L models. Additionally, we provide pretrained weights trained on various datasets.
Please click here to see the multimodal demo pages built with different LLMs. (Both are available in Japanese)
Heron allows you to configure your own V&L models combining various modules. Vision Encoder, Adopter, and LLM can be configured in the configuration file. The distributed learning method and datasets used for training can also be easily configured.
Installation
1. Clone this repository
git clone https://github.com/turingmotors/heron.git
cd heron
2. Install Packages
We recommend using virtual environment to install the required packages. If you want to install the packages globally, use pip install -r requirements.txt
instead.
2-a. Poetry (Recommended)
Using pyenv and Poetry, you can install the required packages as follows:
# install pyenv environment
pyenv install 3.10
pyenv local 3.10
# install packages from pyproject.toml
poetry install
# install local package
pip install --upgrade pip # enable PEP 660 support
pip install -e .
# for development, install pre-commit
pre-commit install
2-b. Anaconda
Using Anaconda, you can install the required packages as follows:
conda create -n heron python=3.10 -y
conda activate heron
pip install --upgrade pip # enable PEP 660 support
pip install -r requirements.txt
pip install -e .
# for development, install pre-commit
pre-commit install
3. Resister for Llama-2 models
To use Llama-2 models, you need to register for the models. First, you request access to the llama-2 models, in HuggingFace page and Meta website.
Please sign-in the HuggingFace account.
huggingface-cli login
Training
For learning, use the yaml configuration file under the projects
directory.
For example, the contents of [projects/opt/exp001.yml](. /projects/opt/exp001.yml) has the following contents:
training_config:
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
num_train_epochs: 1
dataloader_num_workers: 16
fp16: true
optim: "adamw_torch"
learning_rate: 5.0e-5
logging_steps: 100
evaluation_strategy: "steps"
save_strategy: "steps"
eval_steps: 4000
save_steps: 4000
save_total_limit: 1
deepspeed: ./configs/deepspeed/ds_config_zero1.json
output_dir: ./output/
report_to: "wandb"
model_config:
fp16: true
pretrained_path: # None or path to model weight
model_type: git_llm
language_model_name: facebook/opt-350m
vision_model_name: openai/clip-vit-base-patch16
num_image_with_embedding: 1 # if 1, no img_temporal_embedding
max_length: 512
keys_to_finetune:
- visual_projection
- num_image_with_embedding
keys_to_freeze: []
use_lora: true
lora:
r: 8
lora_alpha: 32
target_modules:
- q_proj
- k_proj
- v_proj
lora_dropout: 0.01
bias: none
task_type: CAUSAL_LM
dataset_config_path:
- ./configs/datasets/m3it.yaml
training_config
sets the training configuration, model_config
sets the model configuration, and dataset_config_path
sets the dataset configuration.
The following LLM modules are currently supported for model_type
. We plan to add more supported modules in the future.
To start learning, execute the following command.
./scripts/run.sh
GPU is required for learning; we have tested on Ubuntu 20.04, CUDA 11.7.
Evaluation
You can get the pretrained weight form HuggingFace Hub: turing-motors/heron-chat-git-ja-stablelm-base-7b-v0
See also notebooks.
import requests
from PIL import Image
import torch
from transformers import AutoProcessor
from heron.models.git_llm.git_llama import GitLlamaForCausalLM
device_id = 0
# prepare a pretrained model
model = GitLlamaForCausalLM.from_pretrained('turing-motors/heron-chat-git-ja-stablelm-base-7b-v0')
model.eval()
model.to(f"cuda:{device_id}")
# prepare a processor
processor = AutoProcessor.from_pretrained('turing-motors/heron-chat-git-ja-stablelm-base-7b-v0')
# prepare inputs
url = "https://www.barnorama.com/wp-content/uploads/2016/12/03-Confusing-Pictures.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = f"##Instruction: Please answer the following question concretely. ##Question: What is unusual about this image? Explain precisely and concretely what he is doing? ##Answer: "
# do preprocessing
inputs = processor(
text,
image,
return_tensors="pt",
truncation=True,
)
inputs = {k: v.to(f"cuda:{device_id}") for k, v in inputs.items()}
# set eos token
eos_token_id_list = [
processor.tokenizer.pad_token_id,
processor.tokenizer.eos_token_id,
]
# do inference
with torch.no_grad():
out = model.generate(**inputs, max_length=256, do_sample=False, temperature=0., eos_token_id=eos_token_id_list)
# print result
print(processor.tokenizer.batch_decode(out))
Pretrained Models
model | LLM module | adapter | size |
---|---|---|---|
heron-chat-blip-ja-stablelm-base-7b-v0 | Japanese StableLM Base Alpha | BLIP | 7B |
heron-chat-git-ja-stablelm-base-7b-v0 | Japanese StableLM Base Alpha | GIT | 7B |
heron-chat-git-ELYZA-fast-7b-v0 | ELYZA | GIT | 7B |
heron-preliminary-git-Llama-2-70b-v0 *1 | Llama-2 | GIT | 70B |
*1 This model only applies to pre-training of adapters. |
Datasets
LLava-Instruct dataset translated into Japanese.
LLaVA-Instruct-150K-JA
Organization
License
Released under the Apache License 2.0.
Acknowledgements
- GenerativeImage2Text: The main idea of the model is based on original GIT.
- Llava: This project is learned a lot from the great Llava project.
- GIT-LLM