crawlers low-resource-machine-translation ocr

OCR for Low-Resource Indian Languages

Contributed as a part of Samanantar. Please note that this repo is not being managed actively.

Directory Structure

This pipeline has 4 parts:

Crawler (includes files for crawling, matching articles, extracting metadata, yield estimation for sources)
Preprocessing (Cleaning, Lang detection, Punctuation normalizer, tokenization, sentence splitting)
OCR (Tesseract based OCR, Direct pipeline for sourcing urls and extracting data.)
Alignment (Alignment with BleuAlign, HunAlign, LaBSE)

├── OCR
│   ├── pdf_ocr_reader.py
│   ├── url_pdf_ocr.ipynb
│   └── url_to_ocr.ipynb
├── README.md
├── aligner
│   ├── LaBSEAligner.ipynb
│   └── LaBSE_PDF_aligner.py
├── crawler
│   ├── PDFCrawler.ipynb
│   ├── PDFSourceNameScraper.ipynb
│   ├── PDFSourceNameScraper_Interleaved.ipynb
│   ├── PDFSourceNameScraper_Parallel.ipynb
│   ├── act_aligner.py
│   ├── act_matcher.py
│   ├── crawler.py
│   ├── sm_concatenate_files.py
│   ├── url_crawler.py
│   ├── visionocr.py
│   ├── visionocr_jsontotxt.py
│   └── yield_comparison.ipynb
└── preprocessing
    ├── SentenceSplitter.ipynb
    ├── SentenceSplittingPreprocessedDocuments.ipynb
    ├── indicpostprocessing.py
    ├── json_to_text.ipynb
    ├── postprocessing.py
    └── summary_generator.ipynb

About

Low-Resource OCR

crawlers low-resource-machine-translation ocr

Languages

Language:Jupyter Notebook 92.8%Language:Python 7.2%