aolivtous / LLMs-for-DocVQA

Master's thesis on Large Language Models for Document Visual Question Answering

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Alt text

Large Language Models for Document Visual Question Answering

This repository is an extension of the FastChat repository, implementing baselines and multimodal adaptations for utilizing Language Models (LLMs) in Document-based Visual Question Answering (DocVQA) as part of my Master's Thesis that can be found here.

Models and Dataset

  • LLM: We focus on the Flan-T5 model from the FastChat repository due to computational limitations
  • Dataset: We use a subset of the SP-DocVQA dataset

Features

  • Spatial Information Integration: Various methods have been added to incorporate spatial information from the text of the documents into the prompts.
  • Visual Domain Inclusion: Visual features are extracted using a Vision Transformer (ViT) for each word in the document, enhancing the model's understanding of the visual context.
  • Pre-training Task: A dedicated pre-training task has been designed to enable the Language Model to read and comprehend visual information.
  • Multimodal DocVQA: We propose a multimodal approach to do DocVQA that represents initial steps toward developing a comprehensive and robust solution for addressing this challenging task end-to-end.

Install

git clone https://github.com/aolivtous/LLMs-for-DocVQA.git
cd LLMs-for-DocVQA

Contents

  • Configs: Contains the configuration of the SP-DocVQA dataset
  • Utils: Contains the files to extract the data from the dataset, create the data loaders and save them in JSON files. It also includes the files to perform the evaluation.
  • Modified-Fastchat: Contains the FastChat repository with our modifications for using LLMs for DocVQA.

Usage

Detailed examples of the specific arguments needed to call each of the training or inference files can be found in the folder scripts.

Key files for fine-tuning and inference

Modified-Fastchat
│
├── fastchat
    ├── train
    │   ├── train_flant5_eval.py --> training of the Flan-T5 model in the text-to-text modality for DocVQA
    │   └── train_flant5_vision.py --> training of the Flan-T5 model in the multimodal domain for DocVQA and the pre-training task
    │
    ├── serve
        ├── huggingface_api_inference.py --> inference for the text-to-text DocVQA
        └── huggingface_api_inference_vision_DocVQA.py --> inference for the multimodal DocVQA
        └── huggingface_api_inference_vision_preTask.py --> inference for the reading pre-training task

About

Master's thesis on Large Language Models for Document Visual Question Answering


Languages

Language:Python 96.7%Language:Jupyter Notebook 1.7%Language:Shell 1.6%Language:Dockerfile 0.0%