Bounding boxes required for pretraining?

Question

Bounding boxes required for pretraining?

mustaszewski opened this issue 6 months ago · comments

Does the pre-training of Donut require bounding boxes of individual words? In the synthetically generated SynthDoG dataset (https://huggingface.co/datasets/naver-clova-ix/synthdog-en), which was also used for Donut pretraining, there are no bounding boxes, so I assume that the visual corpus described in the paper also lacks boundig box coordinates.

Felix · Answer 1 · Sat Feb 03 2024 04:32:03 GMT+0800 (China Standard Time)

Im not one of the authors, but as far as I understood Donut only pre-trained on the generated OCR, not the hOCR which would include bounding boxes. Models like UDOP, LILT or LayoutLM come to mind, which do pretty much what you desribe for pre-training and they get good results with the approach.