This is the repository for the 2023/24 quantitative text analysis module of the Methods course in the research master program History at Utrecht University (see: https://www.uu.nl/en/masters/history). See the course manual in the folder with the same name.
The Notebooks folder contains a series of Jupyter Notebooks. The Sample data folder contains some example text files (see below). The aim of the notebooks is to provide an introduction to quantitative text analysis (text mining). The notebooks are structured as listed below. Most notebooks take .txt files as input, but can be tweaked very easily to import .csv files. Text files are ideally chronological and named for the year they represent (for example '1981.txt', '1982.txt', etc.).
Most of the code is my own, or linked in the notebooks to projects I copied it from. Doing things with text 1 is based on code that Berit Jansen (Research Software Lab, Utrecht University) wrote. Brecht Nijman contributed to Doing things with text 4. Thanks to both!
Doing things with text 1 - Preprocessing of a single text file (.txt):
- remove html, punctuation, numbers, short words, stopwords; lowercase
- save cleaned text to file
- basic statistics of text
Doing things with text 2 - word counts on a single, preprocessed text file (.txt):
- most common words as bar chart
- most common words as word cloud
- most common words by word length
Doing things with text 3a - Preprocessing and word counts on multiple text files (.txt, raw or preprocessed):
- same as Doing things with text 1 and 2 but for multiple text files
Doing things with text 3b - Preprocessing and word counts on multiple .csv files (raw text):
- same as Doing things with text 1 and 2 but for one or more csv files
Doing things with text 4 - Text analysis (for multiple .txt files, preprocessed):
- plot word / n-gram frequency per file in a scatter plot
- print and save collocations (log likelihood, pmi, raw frequency) of one or more keywords per file
- print and save top n-grams per file
- print and save top n-grams per file starting or ending with a given keyword
Doing things with text 5 - tf-idf with gensim (for multiple .txt files, preprocessed):
- plot top distinct words (tf-idf) per file in a bar chart
- create heatmap for cosine similarity
Doing things with text 6 - part-of-speech with spacy (for multiple .txt files, preprocessed):
- print most common words of a particular type (adjective, verb, (proper) noun) per file
- print most common named entities per file
Doing things with text 7 - word embeddings with gensim's word2vec (for multiple .txt files, raw or preprocessed):
- train word2vec model on dataset
- search most similar terms for one or more keywords
- plot most similar terms as clusters in a t-sne plot
- screenplays for Star Wars I - VII as .txt
- screenplays for a series of movies about science/scientists as .csv