LAION-AI / OCR-ensemble

OCR-ensemble

Some results

https://colab.research.google.com/drive/1hKu8q2SH80baCj-0IRBb9rLDSgBaU1w7#scrollTo=C9v0iNYVJO6Y

Installation

Follow these steps to set up the environment and install the required dependencies using conda.

Prerequisites

Python 3.9
PyTorch (GPU version)
PaddleOCR

Installing Dependencies

Clone the repository:

git clone git@github.com:LAION-AI/OCR-ensemble.git
cd OCR-ensemble

Create a conda virtual environment (optional, but recommended):

conda create -n your-env-name python=3.9
conda activate your-env-name

Install PyTorch (GPU version) by following the instructions on the official website. Make sure to choose the conda-based installation for your system.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117

Install paddlepaddle by following the instructions on the official GitHub repository. In order to install the GPU version, this might be helpful:

Linux

python -m pip install paddlepaddle-gpu -i https://pypi.tuna.tsinghua.edu.cn/simple

Windows

python -m pip install paddlepaddle-gpu==2.4.2.post117 -f https://www.paddlepaddle.org.cn/whl/windows/mkl/avx/stable.html

Install the remaining required packages from the requirements.txt file:

pip install -r requirements.txt

Overview

Classify document for type of text
Use expert from ensemble of existing OCR + layout parsing models to get text+bboxes of text, —> concant that to caption
If there is no original caption like for screenshots of websites and books, just make a caption, concat that with OCR results
Use this data set to train clip with character level tokenization

Now we are working on Step 2.

Pipeline: 2 Passes

Classify images to determine text types
Expert models process filtered images

Candidate Expert Models

[Printed Document] Machine-printed text https://huggingface.co/naver-clova-ocr/bros-large-uncased https://huggingface.co/microsoft/layoutlmv3-large [multilingual] https://github.com/PaddlePaddle/PaddleOCR
[Handwritten] Handwritten text [implemented] https://huggingface.co/microsoft/trocr-large-handwritten
[Handwritten] Handwritten math [implemented] https://huggingface.co/Azu/trocr-handwritten-math
[Printed Document, Latex formula] Latex expert [implemented] https://colab.research.google.com/drive/1TO10E5fa9KeVyHQBhQQP3VESeigRTcsG?usp=sharing
CLIP language detector, limited functionality [implemented] https://colab.research.google.com/drive/16XU0v8JEeolQ4uK8XL0EbmZhLWLpF0ti?usp=sharing
CLIP text detector, simply detects text or no text in images [implemented] https://colab.research.google.com/drive/1M66t-lnd0QT-opdGS4Bc_zkdzdWAjYCa?usp=sharing

About

Languages

Language:Jupyter Notebook 98.3%Language:Python 1.7%