anujgupta82 / hinglishNorm

A Hindi-English Dataset for Text Normalization

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

hinglishNorm - A Corpus of Hindi-English Code Mixed Sentences for Normalization

A Hindi-English Code-Mixed Dataset for Text Normalization

License

by-nc-sa

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/.

Dataset Description

We are releasing our dataset for Normalization of Hindi-English Code-Mixed Text Data in JSON format.

The object/fields in the released dataset are as shown in the following table:

Field Description Example
id Unique identifier for each datapoint 30
inputText Filtered & cleaned input text whtas ur name
tags We get normalizedText from inputText after applying transformation according to the tags ['Short Form', 'Short Form', 'Looks Good']
normalizedText Manually annotated normalized inputText what is your name

About

A Hindi-English Dataset for Text Normalization


Languages

Language:Python 100.0%