prateek22sri / Sentiment-analysis

Implementation of opinion mining on Kaggle IMDB dataset and Stanford Large Movie Review Dataset

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Authors: Jeshuran Thangaraj, Vaishnavi Mukundhan, Prateek Srivastava

Sentiment Analysis:

Sentiment analysis or Opinion Mining is a process of extracting the opinions in a text rather than the topic of the document. The text would have sentences that are either facts or opinions. We classify the opinions into three categories: Positive, Negative and Neutral.

Datasets:

The existing implementation[1] uses the Kaggle Dataset. In our implementation we have used the below two datasets:

  1. Kaggle IMDB
  2. Large Movie Review Dataset (Stanford Movie Review Dataset)

Existing Implementations:

  • SentiWordNet[2]:
  • SentiWordNet is a resource used to perform Opinion Mining.
  • This resource considers the synset instead of the term as each of the sense would have different opinions.
  • It gives a positive, negative and objective score to each word which totals to 1.
  • SentiWordNet uses Eight ternary classifiers to find the opinion of each word.
  • Naives Bayes to Classify the Opinions[3]:
  • In this approach, they extract a feature vector from the text and directly classify them using the NLTK's Naive Bayes Classifier.
  • This can also be done with other classifiers like SVM.

Our Approaches:

After reviewing multiple approaches, we finally narrowed down on the following two implementations which could be reviewed in the time available.

  • BiGram Model:

    • This is an add on to the pre existing unigram model implementation[1]. In this we first pre process the text by cleansing, Tokenizing, Lemmatizing and POS tagging.
    • Each word is given a score between 0-1 by passing it to the SENTIWORDNET.
    • We changed the scoring mechanism from a unigram model to a bigram model by taking the MAX absolute score of either model.
    • Accuracy :
      • Bigram 61%
      • Unigram 66%
  • Building a WordNet for our Datasets:

    • We started under the assumption that training on similar data will yield better results.
    • Trained a unigram model using 25000 movie reviews.
    • We extracted the ADJ and ADV POS-tags from the training corpus and built a frequency distribution for each word based on its occurrence in positive and negative reviews.
    • It was then used on our test set to predict opinions.
    • Accuracy:
      • Negative Test set 75.4%
      • Positive Test set 67%

Future Approaches:

  • We would like to figure out why the Bigram model is giving us a lesser accuracy than the unigram model.
  • For the second approach we would use Bigram
  • We would also like to try using LSTM.

Code Link

References:

About

Implementation of opinion mining on Kaggle IMDB dataset and Stanford Large Movie Review Dataset


Languages

Language:Python 100.0%