PimHuijnen / QTA2023

Repository for the quantitative text analysis module of the Utrecht University RMA History's Methods course

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

QTA2023

This is the repository for the 2023/24 quantitative text analysis module of the Methods course in the research master program History at Utrecht University (see: https://www.uu.nl/en/masters/history). See the course manual in the folder with the same name.

The Notebooks folder contains a series of Jupyter Notebooks. The Sample data folder contains some example text files (see below). The aim of the notebooks is to provide an introduction to quantitative text analysis (text mining). The notebooks are structured as listed below. Most notebooks take .txt files as input, but can be tweaked very easily to import .csv files. Text files are ideally chronological and named for the year they represent (for example '1981.txt', '1982.txt', etc.).

Most of the code is my own, or linked in the notebooks to projects I copied it from. Doing things with text 1 is based on code that Berit Jansen (Research Software Lab, Utrecht University) wrote. Brecht Nijman contributed to Doing things with text 4. Thanks to both!

Notebooks

Doing things with text 1 - Preprocessing of a single text file (.txt):

  • remove html, punctuation, numbers, short words, stopwords; lowercase
  • save cleaned text to file
  • basic statistics of text

Doing things with text 2 - word counts on a single, preprocessed text file (.txt):

  • most common words as bar chart
  • most common words as word cloud
  • most common words by word length

Doing things with text 3a - Preprocessing and word counts on multiple text files (.txt, raw or preprocessed):

  • same as Doing things with text 1 and 2 but for multiple text files

Doing things with text 3b - Preprocessing and word counts on multiple .csv files (raw text):

  • same as Doing things with text 1 and 2 but for one or more csv files

Doing things with text 4 - Text analysis (for multiple .txt files, preprocessed):

  • plot word / n-gram frequency per file in a scatter plot
  • print and save collocations (log likelihood, pmi, raw frequency) of one or more keywords per file
  • print and save top n-grams per file
  • print and save top n-grams per file starting or ending with a given keyword

Doing things with text 5 - tf-idf with gensim (for multiple .txt files, preprocessed):

  • plot top distinct words (tf-idf) per file in a bar chart
  • create heatmap for cosine similarity

Doing things with text 6 - part-of-speech with spacy (for multiple .txt files, preprocessed):

  • print most common words of a particular type (adjective, verb, (proper) noun) per file
  • print most common named entities per file

Doing things with text 7 - word embeddings with gensim's word2vec (for multiple .txt files, raw or preprocessed):

  • train word2vec model on dataset
  • search most similar terms for one or more keywords
  • plot most similar terms as clusters in a t-sne plot

Sample data

  • screenplays for Star Wars I - VII as .txt
  • screenplays for a series of movies about science/scientists as .csv

About

Repository for the quantitative text analysis module of the Utrecht University RMA History's Methods course


Languages

Language:Jupyter Notebook 100.0%