clovaai / donut

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022

Home Page:https://arxiv.org/abs/2111.15664

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bounding boxes required for pretraining?

mustaszewski opened this issue · comments

Does the pre-training of Donut require bounding boxes of individual words? In the synthetically generated SynthDoG dataset (https://huggingface.co/datasets/naver-clova-ix/synthdog-en), which was also used for Donut pretraining, there are no bounding boxes, so I assume that the visual corpus described in the paper also lacks boundig box coordinates.

Im not one of the authors, but as far as I understood Donut only pre-trained on the generated OCR, not the hOCR which would include bounding boxes. Models like UDOP, LILT or LayoutLM come to mind, which do pretty much what you desribe for pre-training and they get good results with the approach.