anilsenay / CSE4095S22_Grp4

CSE4095 Intoduction to NLP - Project

cse4095 nlp nlp-machine-learning

Natural Language Process Project

Table of Contents

Delivery-1
Delivery-2
Delivery-3
Contributors

Delivery-1

Preprocessing

There are three steps to extract collocations

Preprocessing
- Removing stopwords
- Stemming and Lemmatization
Extracting the Bigrams and Trigrams
Generating Collocations using related methods

N-GRAM Frequencies

Methods

Raw Frequency
PMI
T-Test
Chi-Square
Likelihood Ratio
Poisson Stirling

Raw Frequency Results

PMI Results

T-Test Results

Chi-Square Results

Likelihood Ratio Results

Poisson Stirling Results

Comparison of Methods

Delivery-2

Development Process

There are three steps to build a classifier for classification

Preprocessing
- Removing stopwords
- Tokenization
- Stemming
Classifier
Evaluation

Preprocessing

Dataset words converted to lowercase format
Punctuation marks from the dataset words are removed
Tokenization process by using filtering options on the dataset words like extracting stopwords, applying some regex patterns
Stemmization applied on the dataset words

General Structure

Files from dataset are read
Labels created according to dataset
Data.json file is created which holds labels we decided

Generating Training Datasets

Write data into CSV file
- If there is no available train set created before, the csv file is created
Read train set from CSV
- If there is a previously created csv file that is available, the file is read. “Suç” and “İçtihat” are prepared as lists
Split Dataset
- Common approach is used for splitting the dataset which was 80% for training set and 20% for test set
Vectorize
- Using the data from our test set, a TF-IDF matrix is created

Classifiers

Support Vector Machines (specifically, linear SVM)
Multinomial Naive Bayes
Logistic Regression

Support Vector Machine Results

Multinomial Naive Bayes Results

Logistic Regression Results

Delivery-3

Classifiers

FastText
LSTM

FastText

Dataset is taken from previous iteration
Labels created according to dataset
Label names concatenated with underscore to prevent ambiguity such as

__label__ tag is added to labels for model creation

FastText Results

LSTM Results

Contributors ✨

Thanks goes to these wonderful people (emoji key):

_{Anıl Şenay}

_{Bilgehan Geçici}

_{Kürşat Açıkgöz}

_{Ahmet Önkol}

_{Ahmet Elburuz Gürbüz}

This project follows the all-contributors specification. Contributions of any kind welcome!

About

CSE4095 Intoduction to NLP - Project

cse4095 nlp nlp-machine-learning

Languages

Language:Jupyter Notebook 100.0%