jokame / Kaggle---Bag-of-Words-Meets-Bag-of-Popcorns-using-Word2vec-in-R

An entry to Bag of words meets bag of popcorns using word2vec in R

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Kaggle Bag of Words Meets Bag of Popcorns using Word2vec in R

An entry to Bag of words meets bag of popcorns using word2vec in R

To get competion data, click here

####Packages needed:

  • rword2vec
  • Rcpp and RcppArmadillo
  • rpart and randomForest
  • tm

####Code Explanation:

  • Word vectors are obtained by using rword2vec package.
  • Binary output file is converted into text file for further processing.
  • To create training dataset for sentiment classification for reviews using word vectors obtained above, two popular methods can be used:
  1. Vector Averaging
  2. Clustering
  • In first methods, we have to do vector averaging for each row of labeled and test dataset. There are many ways to do this but I have done this part using Rcpp and RcppArmadillo (R interface to C++) to avoid these compute intensive operations.
  • In clustering,we are doing bag of centroids instead of bag of words. This part is also done using Rcpp and RcppArmadillo to optimize speed.
  • Finally, classsification is done using random forest.

####Note: I'd recommend to read this python tutorial series first for better understanding of vector averaging and clustering.

####Test dataset results:

image

Classification using Vector Averaging

image2

Classification using Clustering

####Results:

  • Accuracy obtained for averaging and bag of centroids is more than their respective threshold but it is still very less.
  • Accuracy can be improved using different machine learning algorithms like GBM,xgboost,neural networks etc and using techniques like stacking, blending, bagging etc.

About

An entry to Bag of words meets bag of popcorns using word2vec in R


Languages

Language:R 84.7%Language:C++ 15.3%