hinglishNorm - A Corpus of Hindi-English Code Mixed Sentences for Normalization

A Hindi-English Code-Mixed Dataset for Text Normalization

License

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/.

Dataset Description

We are releasing our dataset for Normalization of Hindi-English Code-Mixed Text Data in JSON format.

The object/fields in the released dataset are as shown in the following table:

Field	Description	Example
id	Unique identifier for each datapoint	30
inputText	Filtered & cleaned input text	whtas ur name
tags	We get normalizedText from inputText after applying transformation according to the tags	['Short Form', 'Short Form', 'Looks Good']
normalizedText	Manually annotated normalized inputText	what is your name

About

A Hindi-English Dataset for Text Normalization

Languages

Language:Python 100.0%