The point of this project is to train a classifier to distinguish real from fake reviews on trip advisor
##Data ####All data is reviews about a subset of Chicago hotels from various sources
- 4153 real reviews
- scraped from Expedia where reviews are inherently real
- where in order to post a review you must have booked the hotel through Expedia
- scraped from Expedia where reviews are inherently real
- 800 fake reviews
- made by paid workers on Amazon MechanicalTurk on the subset of Chicago hotel analyzed
- thank you to M. Ott, Y. Choi, C. Cardie, and J.T. Hancock. 2011. [Finding Deceptive Opinion Spam by Any Stretch of the Imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies] for providing me with this data! You guys saved me a ton of money and time!
- made by paid workers on Amazon MechanicalTurk on the subset of Chicago hotel analyzed
- 2614 'unlabeled' reviews
- scraped from TripAdvisor where a review can be real or fake
- anyone with an account can review a hotel, so all inherently unsure of the review's veracity
- scraped from TripAdvisor where a review can be real or fake
##Methodoloy
- Problem:
- It is very hard to train a classifier to perform this task because getting reviews which you know are fake is very hard
- First step to solution was acquiring fake reviews that people were payed to write from mturk
- But I didn't think 800 reviews would be enough to train a classifier well
- Second step was designing a Positive-Unlabeled (PU) learning algorithm
- which uses SPY method to infer which reviews from the unlabeled set are fake
- Third and final step is to train a Support Vector Machine classifier on the newly partioned
- Aggreagated the real negative data from SPY method with the 800 fake reviews from mturk [prediction = 0] 2614 real reviews from Expeida [prediction = 1]
- Trained and tested classifier with this data
- First step to solution was acquiring fake reviews that people were payed to write from mturk
- It is very hard to train a classifier to perform this task because getting reviews which you know are fake is very hard
##Positive-Unlabeled (PU) learning algorithm
- Step 1:
- Infer a set of fake reviews from the unlabeled set using the spy method
- SPY method:
- determines a set of 'real negative data' (RN) (meaning fake reviews in this case)
- from only unlabelled(U) and positive(P) data
- pseudo-code:
- RN = null
- S = 15% of P , randomly selected
- U' = U u S -> label: 0
- P' = P - S -> label: 1
- Run I-EM on (U', P') -> produces NB classifier
- classify each document in U' using the NB classifier created
- determine probability threshold (th) using S
- this part is arbitrary
- need to look at what values of S are classified as
- determine from there
- loop and update:
- for each document d-e-U'
- if Pr(1|d) < th then:
- RN = RN u {d}
- if Pr(1|d) < th then:
- for each document d-e-U'
- determines a set of 'real negative data' (RN) (meaning fake reviews in this case)
##Results
- Spy method:
- ran it over 10 times (with 20% spies) and on avg misclassified only 2.5% of the real as fake
- already much better results than I was expecting
- ran it over 10 times (with 20% spies) and on avg misclassified only 2.5% of the real as fake
- Final classification
- Ran sklearn's gridsearch to tune SVC algorithm parameters
- Best performing: SVC(kernel='rbf', C=1, gamma=1, probability=True)
- cross-validated 10 times with 88.2% mean accuracy
- NOTE: untuned SVC was returning around 72% accuracy, so huge improvement!!
- example iteration:
- accuracy: 0.908984145625
- precision: 0.906994047619
- recall: 0.975980784628
- cross-validated 10 times with 88.2% mean accuracy
- Best performing: SVC(kernel='rbf', C=1, gamma=1, probability=True)
- Ran sklearn's gridsearch to tune SVC algorithm parameters
These guys did a similar project using the data set from the M. Ott, Y. Choi, C. Cardie, and J.T. Hancock paper
and these are their results
so I actually did a little better than them! Especially considering they did not add external and novel data once running the SPY step