indic-languages machine-translation multilingual-corpus

Boli-corpus-stats-website

Inspite of the fact that people speaking Indian languages like Hindi and Bengali occupy a large percentage of today’s population; these languages are considered low resource with onlythe IITB Hi-En corpus having more than 1 million parallel aligned sentences. And in the largest publicly available multilingual train corpus for Indian languages (as of March 2021) of PIB corpus, most of other pairs were not even crossing one lakh parallel segments. And such less amount of data would not be enough for the data hungry NMT models. So we aimed at filling this gap and improving the results for Indic Machine Translation by walking along the steps of the IITB corpus collection and researching all the different datasets available publicly and create the corpus of Boli.

The website for the corpus is hosted here. The scripts for the creation of the corpus can be found here

By Kaivalya and Vedant, Supervised by Prof Parag Singla.

About

Scripts that were used to creative an interactive website displaying the stats for the Indic multilingual train corpus - Boli, developed by us

indic-languages machine-translation multilingual-corpus

MIT License

Languages

Language:HTML 55.0%Language:Python 41.1%Language:CSS 3.4%Language:Shell 0.5%