anilsenay / CSE4095S22_Grp4

CSE4095 Intoduction to NLP - Project

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Natural Language Process Project

Table of Contents

Delivery-1

Preprocessing

There are three steps to extract collocations

  • Preprocessing
    • Removing stopwords
    • Stemming and Lemmatization
  • Extracting the Bigrams and Trigrams
  • Generating Collocations using related methods

N-GRAM Frequencies

image

Methods

  • Raw Frequency
  • PMI
  • T-Test
  • Chi-Square
  • Likelihood Ratio
  • Poisson Stirling

Raw Frequency Results

image

PMI Results

image

T-Test Results

image

Chi-Square Results

image

Likelihood Ratio Results

image

Poisson Stirling Results

image

Comparison of Methods

image

image

Delivery-2

Development Process

There are three steps to build a classifier for classification

  • Preprocessing
    • Removing stopwords
    • Tokenization
    • Stemming
  • Classifier
  • Evaluation

Preprocessing

  • Dataset words converted to lowercase format
  • Punctuation marks from the dataset words are removed
  • Tokenization process by using filtering options on the dataset words like extracting stopwords, applying some regex patterns
  • Stemmization applied on the dataset words

General Structure

  • Files from dataset are read
  • Labels created according to dataset
  • Data.json file is created which holds labels we decided

image

Generating Training Datasets

  • Write data into CSV file
    • If there is no available train set created before, the csv file is created
  • Read train set from CSV
    • If there is a previously created csv file that is available, the file is read. “Suç” and “İçtihat” are prepared as lists
  • Split Dataset
    • Common approach is used for splitting the dataset which was 80% for training set and 20% for test set
  • Vectorize
    • Using the data from our test set, a TF-IDF matrix is created

Classifiers

  • Support Vector Machines (specifically, linear SVM)
  • Multinomial Naive Bayes
  • Logistic Regression

Support Vector Machine Results

image

Multinomial Naive Bayes Results

image

Logistic Regression Results

image

Delivery-3

Classifiers

  • FastText
  • LSTM

FastText

  • Dataset is taken from previous iteration
  • Labels created according to dataset
  • Label names concatenated with underscore to prevent ambiguity such as

image

  • __label__ tag is added to labels for model creation

FastText Results

image

LSTM Results

image

Contributors ✨

Thanks goes to these wonderful people (emoji key):


Anıl Şenay

⚠️ 💻

Bilgehan Geçici

⚠️ 💻

Kürşat Açıkgöz

⚠️ 💻

Beyza

⚠️ 💻

Ahmet Önkol

⚠️ 💻

Ahmet Elburuz Gürbüz

⚠️ 💻

This project follows the all-contributors specification. Contributions of any kind welcome!

About

CSE4095 Intoduction to NLP - Project


Languages

Language:Jupyter Notebook 100.0%