tarun2001sharma / CMKT

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

made-with-python


code style: blackCompatibility

CMKT: Code-Mixed toolKiT

CMKT is a wrapper library that makes code-mixed text processing more efficient than ever.

Installation

git clone https://github.com/lingo-iitgn/CMKT.git
cd CMKT
pip install -r "requirements.txt"

Getting Started

Documentation. This page will be updated with more details soon.

How to use this library:
Refer the demo files for toolkit usage or the detailed Google Colab Notebook

Modules:

There are four different modules available:-

  • Data Acquisition Module
  • Preprocessing Module
  • Tasks Module
  • Metrics Module

Data Acquisition Module

This module enables effortless loading and downloading of datasets in various formats from external and local resources. Additionally, it offers a curated collection of 15 datasets tailored for different NLP tasks specific to Hindi-English code-mixed text. Supported file formats of datasets:- pickle json txt csv conll

Datasets available in cmkt datahub for following tasks. Use the specifed names of tasks to search for datasets in cmkt datahub

["lid", "ner", "pos", "machine translation", "sentiment analysis", "hate speech detection", "irony detection", "humor detection", "sarcasm detection"]

Preprocessing Module

Text Preprocessing Module offers a range of functionalities for efficiently preprocessing code-mixed text. This module provides different types of tokenization and stemming specifically designed for code-mixed text. By utilizing the cmkt Text Preprocessing Module, you can efficiently preprocess your code-mixed text data for various downstream tasks such as NLP analysis and model training.

Tokenization

Tokenization in cmkt: Breaking Text into Meaningful Units.
The text preprocessing module includes tokenization techniques at the sentence, word, and subword levels, along with stemming methods for English, Hindi, and Hindi-English mixed script text. Following tokenizers are available in cmkt:-

  • Word Tokenizer
  • Sentence Tokenizer
  • SentencePiece Tokenizer

The tokenizers are currently available for english, hindi and english-hindi mixed script text.

Stemming

Stemming in CMKT: Reducing Words to their Base Form
Stemming is an essential part of code-mixed text processing, enabling the reduction of words to their base or root form. In the CMKT , we provide a range of stemmers specifically designed for different languages and language combinations.
Following tokenizers are available in cmtt:-

  • English Stemmer
  • Hindi Stemmer
  • Hindi-English mixed Stemmer

Tasks Module

This module provides elementary NLP tasks such as NER, POS, LID etc for code-mixed text. This module also provides functions to search for tasks and models available in cmkt. The Hierarchy of task module is defined below.
Task types available in cmkt: "syntactic", "semantic", and "generational"
TaskToolkit (Language specific)
syntactic tasks

  • lid
  • ner
  • pos

semantic tasks
  • sentiment analysis
  • hate speech detection
  • humor detection

generational tasks
  • machine translation

Metrics Module

The metrics module provides a comprehensive range of evaluation metrics, serving diverse needs such as quantifying code-mixed text and assessing the performance of NLP tasks such as classification and machine translation.<br? Available code-mixed metrics:

  • cmi (code-mixed index)
  • m-index (Multilingual Index)
  • i-index (I-index)
  • burstiness

Other common metrics available: accuracy, precision, recall, f-meaure, BLUE score, ROUGE score, BERT score, pearson score, spearman score

About

License:MIT License


Languages

Language:Python 100.0%