clovaai / donut

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022

Home Page:https://arxiv.org/abs/2111.15664

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Request: Dataset and pretrained model for language detection

turian opened this issue · comments

MOTIVATION

Language detection from images is relatively difficult. Adobe and ABBYY OCR require you already know the language of the document before you start OCR.

REQUEST

  • Please use your document generator to generate documents in different languages.
  • Ideally, you would even mix different languages.
  • Release a pretrained model that estimates the percentage of each language in a particular document image.