Authorship Detection

Detect the author, author gender, and literary period of a corpus using deep learning and shallow learning techniques. Data Source: http://www.gutenberg.org/

14 American and British authors (7 male, 7 female), 2 books each, from 7 different literary periods:

Preprocessing

Stanford NER Tagger was used to eliminate proper nouns.

Sample list of words eliminated for Mark Twain:
3040,(Bors, PERSON),(de, PERSON),(Ganis, PERSON),(Sir, PERSON),(Launcelot, PERSON),(Lake, LOCATION),(Sir, LOCATION),(Galahad, LOCATION),(Arthur, PERSON),(Round, ORGANIZATION)
Percent loss by each authors due to the elimination of proper nouns:
{'CharlesDickens': '2.579%', 'EdithWharton': '3.844%', 'FScottFitzgerald': '3.493%', 'HenryDavidThoreau': '2.162%', 'JackLondon': '2.417%', 'JaneAustin': '3.548%', 'JohnLocke': '0.234%', 'KateChopin': '3.171%', 'MargaretFuller': '1.402%', 'MarkTwain': '1.627%', 'MaryShelley': '1.355%', 'MaryWollstonecraft': '0.538%', 'NathanielHawthorne': '1.965%', 'VirginiaWoolf': '3.345%'}

Doc2Vec

For this project, gensim's Doc2Vec was used to vectorize the corpuses.
Following are the hyperparameters chosen for this model:

vec_size = 20
min_count = 2
epochs = 20
alpha = 0.025

Initially, when vectorizing corpuses with Doc2Vec, labels (author, sex, literary period) were assigned to each of the corpuses. Cosine similarity was used to find the most similar label vectors for each corpus:
Top 10 most similar vectors to the sample corpus by Nathaniel Hawthorne (male, gothic/romantic):

Female: 0.6568350791931152
Gothic/Romantic: 0.4545186161994934
Male: 0.43780285120010376
Nathaniel Hawthorne: 0.4020436406135559
Jane Austin: 0.35304591059684753
John Locke: 0.3185476064682007
Enlightenment: 0.3154699206352234
Edith Wharton: 0.28648972511291504
Victorian: 0.22939543426036835
Naturalism: 0.20019471645355225

Then Principal Component Analysis (PCA) was applied to the corpus/label vectors to reduce the dimensionality from 20 to 3 in order to visualize the vectors.
Explained variance ratio was 41.47%; 58.53% was lost by reducing the dimensionality.
The above 3D plot shows that corpus vectors are quite distinct for each author, gender and literary period. However, it was suspected that using authors, genders, and literary periods as corpus labels caused certain data leakage/bias to occur. Hence, instead of those labels, unique IDs (integers from 0 to the number of corpuses) were used as labels for the rest of the project so that each corpus would be vectorized in such a way that it was unaware of its label.
After vectorizing the corpuses in this way, again PCA was applied to them to reduce the dimensionality from 20 to 3. This time, explained variance ratio was 36.81% and 63.19% was lost by reducing the dimensionality.
Interestingly, corpus vectors seem to be pretty distinct for each gender. Also John Locke's writing appears to be especially unique from other authors.
Once the corpus has been vectorized, multiple deep learning and shallow learning methods were used to detect the authors, genders, and literary periods of each corpus.

Multilayer Perceptron (MLP)
As there were not enough rows for multi-class classification, MLP was used only for gender classification (binary classification).
MLP Architecture

2 Layer Multilayer Perceptron

Early Stopping

L1, L2, Dropout

Shallow Learning Models
We also ran a series of shallow learning models for comparison, particularly given that we had a smaller dataset (we did not have 1,000 instances of author or period classes). We ran Naive Bayes, Random Forest, AdaBoost, and K-Neighbors. We tried bag of words vectorization with unigrams and bigrams and Tfidf vectorization with unigrams and bigrams.

Our highest score for the author's gender detection was a Bernoulli Naive Bayes using bag of words with bigram, which accurately predicted the author's gender on the test with 90.4% accuracy.

For a corpus' literary period, our best model on cross-validation was: Random guess chance - 14%.

And for predicting on individual author, our best model on cross-validation was: Random guess chance - 7%

Conclusion
Overall, shallow learning models performed better than deep learning model.

NaokoSuga / authorship_detection

Authorship Detection

14 American and British authors (7 male, 7 female), 2 books each, from 7 different literary periods:

Preprocessing

Doc2Vec

Multilayer Perceptron (MLP)

MLP Architecture

2 Layer Multilayer Perceptron

Early Stopping

L1, L2, Dropout

Shallow Learning Models

Conclusion

About

Languages