slmttndrk / Turkish_Sentiment_Analysis_With_Multinomial_Naive_Bayes

THIS PROJECT IS ABOUT TURKISH SENTIMENT ANALYSIS

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

1. INTRODUCTION

  • In this project, I tried to train a Sentiment Analyzer/Text Classifier for Beyazperde movie critics.

There are lots of resources for English Sentiment Analysis but in Turkish, we have limited resources

for Sentiment Analyzing. In order to increase resources about Turkish Sentiment Analyzing, I started

to this project.

  • Sentiment Analyzing is a branch of Natural Language Processing. In this field’s projects usually

there are some unlabeled data and, you try to predict which class they belong to. In order to implement

this process, there are some Sentiment Analyzing steps.


2. SENTIMENT ANALYZING STEPS


2.1. DATA FETCHING

  • The first rule is to get adequate dataset to train your model efficiently. Here, I have sample movie

critics from Beyazperde. You can find it from this link.


2.2. DATA PREPROCESSING

  • This step is the crucial step for any kind of Machine Learning model training. Real life data is

not always clean. So, you must process your dataset as possible as. In Machine Learning, there

is a ratio that is, data preprocessing/cleaning is 80% and modelling is 20% of overall work. So, I

also splitted data preprocessing into sub steps.


2.2.1. LOAD DATASET

  • Dataset is in the form of csv file. For more dataset, please contact with me

2.2.2. ELIMINATE NAN VALUES

  • Nan values is not useful for training model

2.2.3. ARRANGE DATASET TO AVOID OVERFITTING

  • If the sizes are unbalanced the model overfits while prediction

2.2.4. ELIMINATE TURKISH STOPWORDS AND PUNCTUATIONS

  • Stopwords and punctuations are unnecessary for training model. For more stopwords, please contact with me

2.2.5. NORMALIZATION

  • This corrects the miswritten words and throws meaningless words away

2.2.6. STEMMING/LEMMATIZATION

  • This removes the suffixes and gives us the root of each word

2.3. DATA CLASSIFICATION

  • In this step, you choose a Machine Learning algorithm for Sentiment Analyzing/Text Classification.

All algorithms can be used, but I chose the Multinomial Naive Bayes algorithm. Since, it gives

better accuracy scores on Sentiment Analyzing/Text Classification. This algorithm assumes that

the presence of a particular feature in a class is unreletad to the presence of any other feature.

I also, splitted data classification into some sub steps.


2.3.1. SPLITTING TRAIN AND TEST DATA:

  • Usually, we partition the dataset into 80% as training and 20% as testing data

2.3.2. VECTORIZATION

  • There are some methods such as Bag Of Words, Count Vectorizer and Tfidf Vectorizer. I chose Tfidf Vectorizer.

2.3.3. GRIDSEARCHCV:

  • This method enables us to find the best hyperparameter for the model

2.3.4. FIT AND PREDICT

  • The model learns by fitting and analyzes the sentiment by predicting

2.3.5. OBSERVING ACCURACY, F1, PRECISION AND RECALL SCORES:

  • This scores are useful for comparing model’s success

2.3.6. OBSERVING CONFUSION MATRIX AND PREDICTION PROBABILITIES

  • This gives us an intuition of how confidently the model makes the predictions

  • This enables us to train our model with different samples of the same dataset so that, we can check if it

learned correctly or not


2.4. MODEL PIPELINING AND PICKLING

  • In this step, I create a pipeline for the model. Pipelining prevents us from repeating all steps again

and again. With the help of pipelining, when I give any raw unlabeled data, at first, the model preprocess

it and then, makes prediction. So, it makes our model reusable.

  • Pickling a model means transforming it into binary form. It makes our model portable. When you want to

use the model in different projects, by just loading this pickled file, you can use the model and get

predictions wherever you want.


3. IMPROVEMENTS

  • The model score can be improved by increasing the number of "Turkish Stopwords" or "Beyazperde Dataset".

In both cases, model will be trained more efficiently.


4. CONCLUSION

  • In this project, I learned the concept of Text Classification/Sentiment Analyzing. It also provided

me knowledge base for Natural Language Processing. Since, getting and preprocessing the dataset is

the crucial part of any Machine Learning model training.


5. RESOURCES/THANKS

and data preprocessing tools (normalization, stemming) are provided me by them. I also used python libraries

such as: Sklearn, Pandas, Numpy, Nltk.


About

THIS PROJECT IS ABOUT TURKISH SENTIMENT ANALYSIS

License:GNU General Public License v3.0


Languages

Language:Jupyter Notebook 100.0%