A neural model that extracts the essence out of movie reviews. The model first looks through a large number of movie reviews, forms different n-grams (e.g. unigrams / bigrams / trigrams) and uses POS tagging to filter out the meaningful n-grams for movie reviews. Once those n-grams are filtered, GloVe embeddings and AffinityPropagation Clustering is used for grouping the remaining n-grams into meaningful clusters. Meaningful clusters are later on labeled by hand (600-1000 clusters) for their relevance to a domain. During prediction time the clusters are used with K-Nearest-Neighbours algorithm to predict which of the new phrases are meaningful. Additional filtering, compression and recommendations are applied before presenting the final result.
This model is inspired by the GooglePlay model for review summarization:
Input: 80 Midnight Cowboy User Reviews
Output:
[{'count': 2, 'n_size': 3, 'phrase': 'high quality film', 'score': 2.8},
{'count': 8, 'n_size': 2, 'phrase': 'beautiful film', 'score': 11.2},
{'count': 7, 'n_size': 2, 'phrase': 'fine film', 'score': 9.8},
{'count': 6, 'n_size': 2, 'phrase': 'academy award', 'score': 6.48},
{'count': 4, 'n_size': 2, 'phrase': 'sad film', 'score': 5.6},
{'count': 4, 'n_size': 2, 'phrase': 'fabulous movie', 'score': 5.6},
{'count': 4, 'n_size': 2, 'phrase': 'sexual abuse', 'score': 5.6},
{'count': 3, 'n_size': 2, 'phrase': 'sexual feelings', 'score': 4.2},
{'count': 3, 'n_size': 2, 'phrase': 'powerful film', 'score': 4.2},
{'count': 3, 'n_size': 2, 'phrase': 'insightful character', 'score': 4.2},
{'count': 3, 'n_size': 2, 'phrase': 'only way', 'score': 3.24},
{'count': 2, 'n_size': 2, 'phrase': 'outstanding performances', 'score': 2.8},
{'count': 2, 'n_size': 2, 'phrase': 'sexual revolution', 'score': 2.8},
{'count': 2, 'n_size': 2, 'phrase': 'sexual theatrics', 'score': 2.8},
{'count': 2, 'n_size': 2, 'phrase': 'stellar performances', 'score': 2.8},
{'count': 2, 'n_size': 2, 'phrase': 'sympathetic handsome', 'score': 2.8},
{'count': 54, 'n_size': 1, 'phrase': 'good', 'score': 58.33},
{'count': 53, 'n_size': 1, 'phrase': 'great', 'score': 57.24},
{'count': 43, 'n_size': 1, 'phrase': 'big', 'score': 53.32},
{'count': 23, 'n_size': 1, 'phrase': 'bad', 'score': 24.85}]
Current version : 0.0.0.1
To Be Added
To Be Added
to use jupyter notebooks in the created virtual environment, follow these instructions: https://help.pythonanywhere.com/pages/IPythonNotebookVirtualenvs/
The model is trained in mostly unsupervised manor with a small dataset that needs to be labeled by hand (600-1000 entries). During training n-grams (phrases) are extracted from the 25,000 IMDB reviews and then they are preprocessed and clustered. The cluster centers and labels of each phrase are the output of the model - they are saved in a csv file for later labeling and usage.
For more information about the labeling strategy check model/saved
directory
Predictions go through several steps:
- Pre-processing - makes sure that movies' reviews are broken down into n-grams and all clutter is removed.
- Affinity Model Predictions - predicts the cluster ids of all the remaining n-grams. N-grams that do not fall into the good clusters group are filtered out.
- Summary Recommendations - applies additional steps to filter, compress and recommend the best summaries
The IMDBPreprocessor
has the following public APIs:
load_data
- loads a csv / tsv file with reviews related to a movieprepare_data
- splits all the data into n-grams, POS tagging and POS template filtering
Additionally unigrams
, bigrams
and trigrams
are also public and
ready to use after prepare_data
has been called.
The AffinityClusterModel
has the following important public APIs:
save_model
andload_model
predict
- given n-gram phrases - vectorizes them and uses KNN to predict the most appropriate cluster labelget_phrases_in_good_clusters
- filters any predictions that do not fall into the manually labeled good clusters
The SummaryRecommender
model has the following important APIs:
generate_phrases
- loads reviews about a movie from a file, internally calls theIMDBPreprocessor
APIsrecommend_phrases
- suggests recommended summaries. Internally callsAffinityClusterModel
and then applies Blacklist Filtering, Compression and Summary Recommendations on top
To Be Added
George Stoyanov (georgi.val.stoyan0v@gmail.com)