[AAAI2024] M2Doc: A Multi-Modal Fusion Approach for Document Layout Analysis

The paper is available at this link.

🚧 TODO List

Add training script and inference script for DINO_M2Doc.
Add training script and inference script for other detectors.
Add the data format samples for M2Doc.
Add the dataset converting scripts.
Release the Model-Zoo of M2Doc on DocLayNet.

Installation

Python=3.8.0
CUDA 10.2
transformers
MMDetection

Dataset Prepare

Download dataset you need, dataset downloading links:

Convert datasets OCR annotations.
Using ocr_anno_convert.py to format and sort dataset OCR annotations.
Three test Samples can be found in Annos.

Train and Inference Steps

Install the repository (we recommend to use Anaconda for installation.)

conda create -n m2doc python=3.8 -y
conda activate m2doc
conda install pytorch==1.8.1 torchvision==0.9.1 torchaudio==0.8.1 cudatoolkit=10.2 -c pytorch
git clone https://github.com/johnning2333/M2Doc.git
cd M2Doc/mmdetection
pip install -v -e .
pip install transformers
pip install mmengine
mim install mmcv

Train

# for multi-gpu training
bash mmdetection/tools/dist_train.sh mmdetection/m2doc_config/dino-4scale_w_m2doc_doclaynet.py 8

Inference

# for multi-gpu inference
bash mmdetection/tools/dist_test.sh mmdetection/m2doc_config/dino-4scale_w_m2doc_doclaynet.py work_dirs/dino-4scale_w_m2doc_r50_8xb2-12e_doclaynet/epoch_12.pth 8

Models

The download links of pre-trained M2Doc weights on DocLayNet are provided in the following table.

Name	Backbone	Epoch	mAP	BaiduNetDisk	GoogleDrive
Cascade Mask R-CNN	R50	12	84.6	link	link
Cascade Mask R-CNN	R101	36	85.9	link	link
DINO	R50	12	89.3	link	link
DINO	R101	36	89.5	link	link

Acknowlegement

MMDetection

DINO

VSR

Citation

If our paper helps your research, please cite it in your publications:

@inproceedings{zhang2024m2doc,
  title={M2Doc: A Multi-Modal Fusion Approach for Document Layout Analysis},
  author={Zhang, Ning and Cheng, Hiuyi and Chen, Jiayu and Jiang, Zongyuan and Huang, Jun and Xue, Yang and Jin, Lianwen},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={38},
  number={7},
  pages={7233--7241},
  year={2024}
}

Copyright

For commercial purpose usage, please contact Dr. Lianwen Jin: eelwjin@scut.edu.cn

About

Languages

Language:Jupyter Notebook 72.2%Language:Python 27.8%Language:Shell 0.0%Language:Dockerfile 0.0%Language:Batchfile 0.0%Language:Makefile 0.0%Language:CSS 0.0%