Skip to content

piyushmakhija5/hinglishNorm

Repository files navigation

hinglishNorm - A Corpus of Hindi-English Code Mixed Sentences for Normalization

License

by-nc-sa

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/.

Dataset Description

We are releasing our dataset for Normalization of Hindi-English Code-Mixed Text Data in JSON format.

The object/fields in the released dataset are as shown in the following table:

Field Description Example
id Unique identifier for each datapoint 30
inputText Filtered & cleaned input text whtas ur name
tags We get normalizedText from inputText after applying transformation according to the tags ['Short Form', 'Short Form', 'Looks Good']
normalizedText Manually annotated normalized inputText what is your name

About

A Hindi-English Dataset for Text Normalization

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages