quocthinhvo / docile

2nd LIR DocILE 2023

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DocILE: Document Information Localization and Extraction Benchmark With LiLt


This repository contains our source code of both Task 1 and Task 2 in the DocILE Competition.


Introduction

DocILE is a large-scale research benchmark for cross-evaluation of machine learning methods for Key Information Localization and Extraction (KILE) and Line Item Recognition (LIR) from semi-structured business documents such as invoices, orders etc. Such large-scale benchmark was previously missing (SkalickĂ˝ et al., 2022), hindering comparative evaluation.

Installing

Requirements

We're using Python 3.9 to run this repo: Requirements packages:

  • poppler-utils
  • tensorboard
  • jsonargparse[signatures]
  • tensorrt==8.5.1.7
  • timm
  • transformers==4.26.0
  • datasets==2.11.0
Install

Run this command to install requirements:

pip install -r requirements.txt

Config

  • Change config in ./config/train.cfg to suit your enviroments.
  • Change path to config file in ./run_training.sh and ./run_inference.sh to suit your enviroments.

Resplit dataset

We decided to randomly resplit dataset, 80% for train and 20% for validation to have better score. To randomly resplit dataset, change dataset_path in ./data_split.py and run:

python3 data_split.py

Train

Run ./run_training.sh

bash run_training.sh

The models will be save at OUPUT_DIR (config)

Inference

To run inference, follow below instruction.

  • On test set, call:
bash run_inference.sh test
  • On validation set, call:
bash run_inference.sh val

The result will be save at PREDICTION_DIR (con)

Contributing

Links

https://huggingface.co/docs/transformers/model_doc/lilt

https://docile.rossum.ai/

About

2nd LIR DocILE 2023

License:MIT License


Languages

Language:Python 96.7%Language:Shell 2.1%Language:Jupyter Notebook 1.2%