QTA2023

This is the repository for the 2023/24 quantitative text analysis module of the Methods course in the research master program History at Utrecht University (see: https://www.uu.nl/en/masters/history). See the course manual in the folder with the same name.

The Notebooks folder contains a series of Jupyter Notebooks. The Sample data folder contains some example text files (see below). The aim of the notebooks is to provide an introduction to quantitative text analysis (text mining). The notebooks are structured as listed below. Most notebooks take .txt files as input, but can be tweaked very easily to import .csv files. Text files are ideally chronological and named for the year they represent (for example '1981.txt', '1982.txt', etc.).

Most of the code is my own, or linked in the notebooks to projects I copied it from. Doing things with text 1 is based on code that Berit Jansen (Research Software Lab, Utrecht University) wrote. Brecht Nijman contributed to Doing things with text 4. Thanks to both!

Notebooks

Doing things with text 1 - Preprocessing of a single text file (.txt):

remove html, punctuation, numbers, short words, stopwords; lowercase
save cleaned text to file
basic statistics of text

Doing things with text 2 - word counts on a single, preprocessed text file (.txt):

most common words as bar chart
most common words as word cloud
most common words by word length

Doing things with text 3a - Preprocessing and word counts on multiple text files (.txt, raw or preprocessed):

same as Doing things with text 1 and 2 but for multiple text files

Doing things with text 3b - Preprocessing and word counts on multiple .csv files (raw text):

same as Doing things with text 1 and 2 but for one or more csv files

Doing things with text 4 - Text analysis (for multiple .txt files, preprocessed):

plot word / n-gram frequency per file in a scatter plot
print and save collocations (log likelihood, pmi, raw frequency) of one or more keywords per file
print and save top n-grams per file
print and save top n-grams per file starting or ending with a given keyword

Doing things with text 5 - tf-idf with gensim (for multiple .txt files, preprocessed):

plot top distinct words (tf-idf) per file in a bar chart
create heatmap for cosine similarity

Doing things with text 6 - part-of-speech with spacy (for multiple .txt files, preprocessed):

print most common words of a particular type (adjective, verb, (proper) noun) per file
print most common named entities per file

Doing things with text 7 - word embeddings with gensim's word2vec (for multiple .txt files, raw or preprocessed):

train word2vec model on dataset
search most similar terms for one or more keywords
plot most similar terms as clusters in a t-sne plot

Sample data

screenplays for Star Wars I - VII as .txt
screenplays for a series of movies about science/scientists as .csv

PimHuijnen / QTA2023

QTA2023

Notebooks

Sample data

About

Languages