project-anuvaad / anuvaad-ocr-corpus

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Anuvaad OCR Corpus

This repository contains corpus links for popular Indian languages developed as part of the Anuvaad project.

Please reach out to nlp-nmt@tarento.com for any clarification/interpretation/usage of the linked datasets.

Shield: CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

Status

Goal

The goal is to build high quality corpus extracted from pdfs for the Indian languages across various domains (General, Legal, Education, Healthcare, Automobile, News etc). This can be eventually used to train the ML models based on the use cases.

Read more about Anuvaad @ http://anuvaad.org/

Links

English

Domain Source Sentence count Corpus Download Link
Educational NCERT 2,03,000 Class-1
Class-2
Class-3
Class-4
Class-5
Class-6
Class-7
Class-8
Class-9
Class-10
Class-11
Class-12
Educational ebalbharti 90,900 Class-1
Class-2
Class-3
Class-4
Class-5
Class-6
Class-7
Class-8
Class-9
Class-10
Class-11
Class-12
Educational NIOS-Diploma 32,100 All

Hindi

Domain Source Sentence count Corpus Download Link
Educational NCERT 2,19,000 Class-1
Class-2
Class-3
Class-4
Class-5
Class-6
Class-7
Class-8
Class-9
Class-10
Class-11
Class-12
Educational ebalbharti 61,500 Class-1
Class-2
Class-3
Class-4
Class-5
Class-6
Class-7
Class-8
Class-9
Class-10
Class-11
Class-12
Educational NIOS-Diploma 31,500 All

Bengali

Domain Source Sentence count Corpus Download Link
Educational ebalbharti 17,200 Class-1
Class-2
Class-3
Class-4
Class-5
Class-11
Class-12
Educational NIOS-Diploma 29,800 All

Tamil

Domain Source Sentence count Corpus Download Link
Educational ebalbharti 10,800 Class-1
Class-2
Class-3
Class-4
Educational NIOS-Diploma 31,700 All

Malayalam

Domain Source Sentence count Corpus Download Link

Telugu

Domain Source Sentence count Corpus Download Link
Educational ebalbharti 69,600 Class-1
Class-2
Class-3
Class-4
Class-5
Class-6
Class-7
Class-8
Class-9
Class-10
Class-11
Class-12
Educational NIOS-Diploma 29,800 All

Kannada

Domain Source Sentence count Corpus Download Link
Educational ebalbharti 61,600 Class-1
Class-2
Class-3
Class-4
Class-5
Class-6
Class-7
Class-8
Class-9
Class-10
Class-11
Class-12
Educational NIOS-Diploma 27,200 All

Marathi

Domain Source Sentence count Corpus Download Link
Educational ebalbharti 68,900 Class-1
Class-2
Class-3
Class-4
Class-5
Class-6
Class-7
Class-8
Class-9
Class-10
Class-11
Class-12
Educational NIOS-Diploma 26,100 All

Punjabi

Domain Source Sentence count Corpus Download Link
Educational NIOS-Diploma 18,900 All

Gujarati

Domain Source Sentence count Corpus Download Link
Educational ebalbharti 63,600 Class-1
Class-2
Class-3
Class-4
Class-5
Class-6
Class-7
Class-8
Class-9
Class-10
Class-11
Class-12
Educational NIOS-Diploma 36,400 All

Assamese

Domain Source Sentence count Corpus Download Link
Educational NIOS-Diploma 27,400 All

Urdu

Domain Source Sentence count Corpus Download Link

Odia

Domain Source Sentence count Corpus Download Link
Educational NIOS-Diploma 27,400 All

About

License:Creative Commons Attribution 4.0 International