sumeet-iitg / CS-TextNormalization

We build a pipeline that does spelling normalization over Code-Switched text

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CS-TextNormalization

We build a pipeline to clean text noisy code-switched text online.

Getting the repo

git clone --recursive https://github.com/sumeet-iitg/CS-TextNormalization.git

-- Don't miss the 'recursive' part for pulling required sub-modules

Components of the Normalization Pipeline

  • DataManagement: This folder contains the various abstractions that make up the pipeline. When you add a new implementation of some tool for the pipeline, make sure that it is always along the lines of an abstraction contained in this folder. Feel free to add new abstractions into this folder. Some of the abstractions are as follows:
    languageUtils.py: Classes for Langauge Specific Identifiers, Lexicons and SpellCheckers.
    dataloader.py: Classes for loading a corpus - mono-lingual/multi-lingual.

Requirements

Usage

You can use this pipeline end to end, or run the individual components within

python main.py "source_tanglish.txt" "english,telugu"

About

We build a pipeline that does spelling normalization over Code-Switched text


Languages

Language:Python 100.0%