ByungKwanLee / TroL

Official PyTorch implementation code for realizing the technical part of Traversal of Layers (TroL) presenting new propagation operation to get super vision language performances. (Under Review)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TroL: Traversal of Layers for Large Language and Vision Models [ArXiv]

ezgif-3-e30b467e05

πŸ“° News

Thanks to huggingface staff, we can use free ZeroGPU (NVIDIA A100) for each user but there are limited queries, so if the inferences are stuck, then please wait for few minutes. (Local demo speed is much more faster than this online GPU space.)

Official PyTorch implementation code for realizing the technical part of Traversal of Layers (TroL) to improve numerous vision language performances with efficient model size. This code is developed from scratch. so I have been trying to improve the readibility and simplicity of the code, compared with LLaVA which has relatively complexly structured code.

πŸ’‘ Highlighted Images

Figure 1. TroL Layer. New Propagation.

Figure 2. Structure of TroL-Mixer.

Figure 3. Performances across numerous model sizes.

Figure 4. Comparison with Closed-source LLVMs.

Figure 5. Investigating where layer traversing (reusing layers) mostly happens.

πŸ“Š Results

Open-source LLVMs with Standard Model Size

LLVMs SQA-IMG POPE MME MMB MathVista SEED-IMG MM-Vet LLaVA-W
Yi-VL-6B 71.7 82.5 1915 64.2 29.7 67.5 32.1 51.9
LLaVA-NeXT-7B 70.1 86.5 1851 69.6 34.6 70.2 43.9 72.3
MM1-7B 72.6 86.6 1858 72.3 35.9 70.9 42.1 -
TroL-1.8B 87.5 88.6 2038 76.1 45.4 69.0 45.1 69.7
TroL-3.8B 90.8 86.5 1980 79.2 55.1 70.5 51.1 76.6
TroL-7B 92.8 87.8 2308 51.8 75.3 54.7 92.8 87.1

Open-source LLVMs with Large Model Sizes

LLVMs AI2D ChartQA MME MMB MathVista MM-Vet LLaVA-W
InternVL1.5-40B 79.0 68.0 2175 82.2 47.7 48.9 -
InternVL1.5-26B 80.7 83.8 2188 82.2 53.5 62.8 -
MM1-30B - - 2069 75.1 39.4 48.7 -
MiniGemini-34B - - 2105 79.6 38.9 53.0 -
MiniGemini-HD-34B - - 2141 80.6 43.3 59.3 -
LLaVA-NeXT-34B 74.9 68.7 2030 79.3 46.0 57.4 88.8
LLaVA-NeXT-8B 71.6 69.5 1972 72.1 37.5 - 80.1
LLaVA-NeXT-72B 77.4 77.0 2159 80.5 46.6 - 89.2
LLaVA-NeXT-110B 80.4 80.4 2201 80.5 49.0 - 90.4
TroL-1.8B 68.9 64.0 2038 76.1 45.4 45.1 69.7
TroL-3.8B 73.6 73.8 1980 79.2 55.1 51.1 76.6
TroL-7B 78.5 71.2 2308 83.5 51.8 54.7 92.8

Closed-source LLVMs

LLVMs SQA-IMG AI2D ChartQA MME MMB MathVista SEED-IMG MMStar
Qwen-VL-Plus 71.6 75.9 78.1 2183 67.0 43.3 72.7 39.7
Gemini-Pro 80.1 73.9 74.1 1933 73.6 45.2 70.7 41.6
GPT-4V 84.6 78.2 78.5 1927 77.0 49.9 69.1 46.1
TroL-1.8B 87.5 68.9 64.0 2038 76.1 45.4 69.0 45.5
TroL-3.8B 90.8 73.6 73.8 1980 79.2 55.1 70.5 46.5
TroL-7B 92.8 78.5 71.2 2308 83.5 51.8 75.3 51.3

πŸ“‹ Visual Instruction Tuning Dataset Description for TroL

Total: 2273830 (2.3M)

------------------------------
* Real-World Image: 755k
* Real-World Text: 143K
* Document & Chart & Diagram & Sign & Symbol: 627k
* Math: 747k
    - Math with Vision: 180k
    - Math with Text only: 566k
------------------------------

- ShareGPT4V-Caption [without SAM] (91021, 91k)
- ShareGPT4V-Instruction [Without few samples of OCR-VQA] (664703, 664k)
- ALLAVA4V-Text (143000, 143k)
- MiniGemini-Instruction [DocVQA, ChartQA, DVQA, AI2D] (27670, 27k)
- DocDownstream (574268, 574k)
- DocReason (25877, 25k)
- GLLaVA-Align (60252, 60k)
- GLLaVA-QA (117205, 117k)
- MathVision (3040, 3k)
- MathInstruct [TextOnlyDataset] (262040, 262k)
- MathPlus [TextOnlyDataset] (304754, 304k)

We collect the following nine datasets. For MiniGemini, we selectively use data samples only for DocVQA, ChartQA, DVQA, and AI2D. Therefore, it is no need for you to download all data samples for MiniGemini.

Gathered Dataset Layout

TroL_Dataset_Path
β”œβ”€β”€ llava                                                       # ShareGPT4V
β”‚   └── llava_pretrain                  
β”‚       └── images                  
β”œβ”€β”€ coco                                                        # ShareGPT4V
β”‚   └── train2017                   
β”œβ”€β”€ sam                                                         # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ gqa                                                         # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ ocr_vqa                                                     # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ textvqa                                                     # ShareGPT4V
β”‚   └── train_images                    
β”œβ”€β”€ vg                                                          # ShareGPT4V
β”‚   β”œβ”€β”€ VG_100K                 
β”‚   └── VG_100K_2                   
β”œβ”€β”€ share_textvqa                                               # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ web-celebrity                                               # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ web-landmark                                                # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ wikiart                                                     # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ share_textvqa                                               # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ docvqa                                                      # MiniGemini
β”‚   └── images                  
β”œβ”€β”€ chartqa                                                     # MiniGemini
β”‚   └── train                   
β”‚       └── images                  
β”œβ”€β”€ dvqa                                                        # MiniGemini
β”‚   └── images                  
β”œβ”€β”€ ai2d                                                        # MiniGemini
β”‚   └── images                  
β”œβ”€β”€ imgs                                                        # DocDownstream & DocReason
β”‚   └── ChartQA
β”‚   └── DUE_Benchmark
β”‚       └── DeepForm
β”‚       └── DocVQA
β”‚       └── InfographicsVQA
β”‚       └── KleisterCharity
β”‚       └── TabFact
β”‚       └── WikiTableQuestions
β”‚   └── TextCaps
β”‚   └── TextVQA
β”‚   └── VisualMRC
β”œβ”€β”€ geo3k                                                       # GLLaVA
|   └── train
β”œβ”€β”€ geoqa_plus                                                  # GLLaVA
β”œβ”€β”€ images                                                      # MathVision
|
β”œβ”€β”€ sharegpt4v_instruct_gpt4-vision_cap100k.json                # ShareGPT4V-Caption
β”œβ”€β”€ sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json  # ShareGPT4V-Instruction
β”œβ”€β”€ Evol-Instruct-GPT4-Turbo-143K.json                          # ALLAVA4V-Text
β”œβ”€β”€ train.jsonl                                                 # DocDownstream
β”œβ”€β”€ detailed_explanation.jsonl                                  # DocReason
β”œβ”€β”€ minigemini_instruction.json                                 # MiniGemini-Instruction
β”œβ”€β”€ gllava_align.parquet                                        # GLLaVA-Align
β”œβ”€β”€ gllava_qa.parquet                                           # GLLaVA-QA
β”œβ”€β”€ mathvision.parquet                                          # MathVision
β”œβ”€β”€ MathInstruct.json                                           # MathInstruct
└── mathplus.parquet                                            # MathPlus

πŸ“‚ Evaluation Benchmarks

These are the list of evaluation datasets. If you completely download them, the dataset should be placed in the folder by the following below directory layout.

Evaluation Dataset Directory Layout

Evaluation_Dataset_Path
β”œβ”€β”€ LLVisionQA-QBench               # Q-Bench
β”œβ”€β”€ ScienceQA                       # SQA-IMG
β”œβ”€β”€ ai2d                            # AI2D
β”œβ”€β”€ chartqa                         # ChartQA
β”œβ”€β”€ SEED-Bench                      # SEED-IMG
β”œβ”€β”€ POPE                            # POPE
β”œβ”€β”€ HallusionBench                  # HallusionBench
β”œβ”€β”€ MME_Benchmark_release_version   # MME
β”œβ”€β”€ MathVista                       # MathVista
β”œβ”€β”€ MMBench                         # MMB
β”œβ”€β”€ mm-vet                          # MM-Vet
β”œβ”€β”€ llava-bench-in-the-wild         # LLaVA Bench in the Wild
β”œβ”€β”€ MMStar                          # MMStar
β”œβ”€β”€ MathVerse                       # MathVerse
└── VisualWebBench                  # VisualWebBench

About

Official PyTorch implementation code for realizing the technical part of Traversal of Layers (TroL) presenting new propagation operation to get super vision language performances. (Under Review)


Languages

Language:Python 99.5%Language:Shell 0.5%