The paper is available at this link.
- Add training script and inference script for DINO_M2Doc.
- Add training script and inference script for other detectors.
- Add the data format samples for M2Doc.
- Add the dataset converting scripts.
- Release the Model-Zoo of M2Doc on DocLayNet.
- Python=3.8.0
- CUDA 10.2
- transformers
- MMDetection
- Download dataset you need, dataset downloading links:
- Convert datasets OCR annotations.
Using ocr_anno_convert.py to format and sort dataset OCR annotations.
Three test Samples can be found in Annos.
- Install the repository (we recommend to use Anaconda for installation.)
conda create -n m2doc python=3.8 -y
conda activate m2doc
conda install pytorch==1.8.1 torchvision==0.9.1 torchaudio==0.8.1 cudatoolkit=10.2 -c pytorch
git clone https://github.com/johnning2333/M2Doc.git
cd M2Doc/mmdetection
pip install -v -e .
pip install transformers
pip install mmengine
mim install mmcv
- Train
# for multi-gpu training
bash mmdetection/tools/dist_train.sh mmdetection/m2doc_config/dino-4scale_w_m2doc_doclaynet.py 8
- Inference
# for multi-gpu inference
bash mmdetection/tools/dist_test.sh mmdetection/m2doc_config/dino-4scale_w_m2doc_doclaynet.py work_dirs/dino-4scale_w_m2doc_r50_8xb2-12e_doclaynet/epoch_12.pth 8
The download links of pre-trained M2Doc weights on DocLayNet are provided in the following table.
Name | Backbone | Epoch | mAP | BaiduNetDisk | GoogleDrive |
---|---|---|---|---|---|
Cascade Mask R-CNN | R50 | 12 | 84.6 | link | link |
Cascade Mask R-CNN | R101 | 36 | 85.9 | link | link |
DINO | R50 | 12 | 89.3 | link | link |
DINO | R101 | 36 | 89.5 | link | link |
If our paper helps your research, please cite it in your publications:
@inproceedings{zhang2024m2doc,
title={M2Doc: A Multi-Modal Fusion Approach for Document Layout Analysis},
author={Zhang, Ning and Cheng, Hiuyi and Chen, Jiayu and Jiang, Zongyuan and Huang, Jun and Xue, Yang and Jin, Lianwen},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={38},
number={7},
pages={7233--7241},
year={2024}
}
For commercial purpose usage, please contact Dr. Lianwen Jin: eelwjin@scut.edu.cn
Copyright 2019, Deep Learning and Vision Computing Lab, South China China University of Technology. http://www.dlvc-lab.net