LSMeetei / MnMultimodal

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Multimodal Dataset of low-resource languages for Machine Translation and other NLP Applications

This repo house the collection of Manipuri multimodal datatset for Natural Language Processing (NLP) Applications.

The dataset consists of the following tuples:

  • Image
  • Caption in English
  • Translation of the English text in Manipuri, Bengali, Hindi and German.
  • Audio recording of the Manipuri text by the native speakers.

Version 1:

The process begins by collecting English text and images from the local newspaper Imphal Free Press using our in-house web scraper. Subsequently, the English text undergoes manual translation into Manipuri, followed by machine translation into Bengali, Hindi, and German.

Version 2:

Version 1 + Manipuri Text and Images are collected from a local newspaper Huiyen Lanpao. Manipuri Text is manual translated into English, and then the translated English text is machine-translated into Bengali, Hindi, and German.

This comprehensive approach allows us to leverage both human expertise and automated translation technologies to facilitate multilingual access to the content.

Translation approach:

English to Manipuri and Manipuri to English: Manual Translation + Manual Post-editing

English to Bengali and Hindi: Indic-Trans

English to German: DeepL

A sample dataset is provided for reference.

Please fill up this form for a request to access the data and other supplementary files