clairempr / spooky-classify

Text classification with scikit-learn, used to make predictions for Kaggle Spooky Author Identification competition

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Spooky-Classify

Claire Pritchard
January 2018

Text classification with scikit-learn and spaCy. This was used to identify the author and generate predictions for the Kaggle Spooky Author Identification Competition, December, 2017. The dataset consists of text written by Edgar Allan Poe, HP Lovecraft, and Mary Shelley.

The data files can be downloaded from Kaggle. Training data is in train.csv, and the test data set for generating predictions is in test.csv.

As you can see when plotting the distribution of author labels in the training dataset with matplotlib, there are quite a few more samples from Poe than from Lovecraft or Shelley. Rather than trying to find more Lovecraft and Shelley samples, I chose to resample using imbalanced-learn. Author distribution

The model I finally arrived at is a VotingClassifier using as estimators the three classifiers with predict_proba support that had the highest accuracy. The VotingClassifier performed slightly better than the individual models, which were MultinomialNB, BernoulliNB, and MLPClassifier.

Accuracy was also improved slightly by the addition of a few new features: sentence length and standard deviation of the lengths of the words in the sentence. The sentences were tokenized using spaCy.

After fitting the model, I got a score of 0.9988 on the training data and 0.8652 on the data held out for testing. Making predictions for the held out test data resulted in the following classification report and confusion matrix:

precision recall f1-score support
EAP 0.83 0.92 0.87 1999
HPL 0.92 0.80 0.86 1388
MWS 0.86 0.86 0.86 1508
avg / total 0.87 0.87 0.86 4895

Confusion matrix

About

Text classification with scikit-learn, used to make predictions for Kaggle Spooky Author Identification competition

License:Apache License 2.0


Languages

Language:Python 100.0%