Logahn / language-classification-system

A model that utilizes text data preprocessing, feature engineering, correlation reduction using PCA, and training a Decision Tree Classifier to classify languages based on input text data.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Language Classification System

This Jupyter notebook contains a system for classifying languages based on text data. It includes data preprocessing, feature engineering, correlation reduction, and machine learning model training.

Modules and Packages Used

The following modules and packages were used in this notebook:

  • Matplotlib
  • Pickle
  • Scikit-learn
  • Numpy
  • Pandas
  • Math
  • Collections
  • Itertools
  • Seaborn

Dataset

The dataset used in this notebook is stored in a CSV file named 'language.csv'. The dataset was preprocessed to remove any missing data and convert the 'text' and 'language' columns to strings.

Creating Set of Features

A set of features was created based on the text data. It includes word count, character count, word density, punctuation count, vowel and consonant character count, exclamation and question mark count, unique words count, repeat words count, and more.

Correlation Reduction

Principal Component Analysis (PCA) was applied to reduce the correlation between the features.

Machine Learning Model

A Decision Tree Classifier was trained on the dataset and used to predict the language of text data. The trained model was saved using Pickle. The accuracy score of the model is displayed in a confusion matrix.

Usage

To use this system, you can run the code in the Jupyter notebook and provide your own text data to predict its language.

Link

Google colab file is located at https://colab.research.google.com/drive/1M_zRJwISxTOL4SU2Yo9V4ZcQHzYqpozU

About

A model that utilizes text data preprocessing, feature engineering, correlation reduction using PCA, and training a Decision Tree Classifier to classify languages based on input text data.


Languages

Language:Python 100.0%