computational-literary-studies corpus-data supervised-learning support-vector-machines

Identifying Crosswriters’ Altering Style in Books for Children and Adults Using Supervised Machine Learning

This repo contains the code (not the data!) written as part of the Computational Literary Studies (CLS) final project at the University of Antwerp.

The objective was to identify the differences (if any) in the writing style of authors who write books for children and adults ("crosswriters") by only focusing on content words.

The paper is available in PDF format.

Development

Dependencies

Standard:

__future__
os
re
glob
typing

Third-party:

Numpy
Pandas
Matplotlib
Seaborn
Scikit-Learn
Transformers
pprint

Environment

Windows 11 + WSL
Python 3.9.12 (virtualenv)

Abstract

Stylometry is the quantitative study of literary style through computational distant reading methods. It is based on the observation that authors tend to write in relatively consistent, recognisable, and unique ways (Laramée, 2018). Identifying the similarities and differences in style, content, and genre between literature intended for children and adults has always been under the radar of researchers in the field of Computational Literary Studies. However, only recently has examining the implications of cross-writing (i.e., writing works for various readership age groups) gotten attention. In this study, supervised machine learning methods were applied to get a better understanding on whether and how such authors (“crosswriters”) alter their style when targeting a different age group, based entirely on content words. The study was conducted on 5 English authors, and the SVM models reach an F1 macro score of .73 when predicting the age group using all texts and .93 on average for each of the authors individually. To achieve these results, it was essential to overcome the issue of overfitting on the characters of the stories, which was dealt with by (a) implementing a Named Entity Recognition (NER) step in the preprocessing pipeline; and (b) leaving at least one book by each author out of the train set entirely in each of the folds during Cross-Validation.

Exploratory Data Analysis

The authors whose texts were examined are:

David Almond
Anna Fine
Neil Gaiman
Philip Pullman
J.K. Rowling

The images are light and dark-mode aware! Check it out through your appearance settings.

Corpus

Number of books per gender of authors:

Number of books per reader age group:

Number of segments per author and reader age group:

Distribution of total words:

Type-token ratio:

Authors

Publications per author over time

Results

	Pre-NER	~	Post-NER	~
	Acc.	F1	Acc.	F1
David Almond	.914	.795	.933	.857
Anne Fine	.843	.800	.979	.976
Neil Gaiman	.939	.928	.931	.916
Phillip Pullman	.918	.771	.962	.906
J.K. Rowling	.991	.991	.999	.999
All authors	.764	.680	.788	.734

Citation

@article{
    title = {Identifying Crosswriters' Altering Style in Books for Children and Adults Using Supervised Machine Learning},
    author = {{Dimitris Boumparis}},
    organization = {{University of Antwerp}},
    year = {2022},
    url = {https://github.com/dimboump/crosswriters}
}

About

Code for final assignment for CLS course at the University of Antwerp (SoSe 2022)

computational-literary-studies corpus-data supervised-learning support-vector-machines

Languages

Language:Jupyter Notebook 99.9%Language:Python 0.1%