johnning2333 / M2Doc

Repository from Github https://github.comjohnning2333/M2DocRepository from Github https://github.comjohnning2333/M2Doc

[AAAI2024] M2Doc: A Multi-Modal Fusion Approach for Document Layout Analysis

The paper is available at this link.

🚧 TODO List

  • Add training script and inference script for DINO_M2Doc.
  • Add training script and inference script for other detectors.
  • Add the data format samples for M2Doc.
  • Add the dataset converting scripts.
  • Release the Model-Zoo of M2Doc on DocLayNet.

Installation

  • Python=3.8.0
  • CUDA 10.2
  • transformers
  • MMDetection

Dataset Prepare

  1. Download dataset you need, dataset downloading links:
  1. Convert datasets OCR annotations.
    Using ocr_anno_convert.py to format and sort dataset OCR annotations.
    Three test Samples can be found in Annos.

Train and Inference Steps

  1. Install the repository (we recommend to use Anaconda for installation.)
conda create -n m2doc python=3.8 -y
conda activate m2doc
conda install pytorch==1.8.1 torchvision==0.9.1 torchaudio==0.8.1 cudatoolkit=10.2 -c pytorch
git clone https://github.com/johnning2333/M2Doc.git
cd M2Doc/mmdetection
pip install -v -e .
pip install transformers
pip install mmengine
mim install mmcv
  1. Train
# for multi-gpu training
bash mmdetection/tools/dist_train.sh mmdetection/m2doc_config/dino-4scale_w_m2doc_doclaynet.py 8
  1. Inference
# for multi-gpu inference
bash mmdetection/tools/dist_test.sh mmdetection/m2doc_config/dino-4scale_w_m2doc_doclaynet.py work_dirs/dino-4scale_w_m2doc_r50_8xb2-12e_doclaynet/epoch_12.pth 8

Models

The download links of pre-trained M2Doc weights on DocLayNet are provided in the following table.

Name Backbone Epoch mAP BaiduNetDisk GoogleDrive
Cascade Mask R-CNN R50 12 84.6 link link
Cascade Mask R-CNN R101 36 85.9 link link
DINO R50 12 89.3 link link
DINO R101 36 89.5 link link

Acknowlegement

MMDetection

DINO

VSR

Citation

If our paper helps your research, please cite it in your publications:

@inproceedings{zhang2024m2doc,
  title={M2Doc: A Multi-Modal Fusion Approach for Document Layout Analysis},
  author={Zhang, Ning and Cheng, Hiuyi and Chen, Jiayu and Jiang, Zongyuan and Huang, Jun and Xue, Yang and Jin, Lianwen},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={38},
  number={7},
  pages={7233--7241},
  year={2024}
}

Copyright

For commercial purpose usage, please contact Dr. Lianwen Jin: eelwjin@scut.edu.cn

Copyright 2019, Deep Learning and Vision Computing Lab, South China China University of Technology. http://www.dlvc-lab.net

About


Languages

Language:Jupyter Notebook 72.2%Language:Python 27.8%Language:Shell 0.0%Language:Dockerfile 0.0%Language:Batchfile 0.0%Language:Makefile 0.0%Language:CSS 0.0%