charlottelambert / DETM

Fork of D-ETM repo for working with Old Bailey data.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

D-ETM

This repository is a fork of code accompanying the paper titled "The Dynamic Embedded Topic Model" by Adji B. Dieng, Francisco J. R. Ruiz, and David M. Blei. (Arxiv link: https://arxiv.org/abs/1907.05545). The code is adapted to work with the Proceedings of Old Bailey in this project.

Steps to run

All the following steps are intended to be run on the TSV data representation of the Proceedings of Old Bailey. Go here to see information about obtaining this data.

Data

The first step is to divide the data into time slices. This is done by changing the second column of the tsv file, which indicates year. The data can either be split into time slices of 10 years or 100 years. Run the following line of code:

cd scripts
./make-slices.py BASE_DIR YEAR_SPLIT

This will collect all files in BASE_DIR with the extension .tsv and write a new file to the same directory with the extension .tsv-decades or .tsv-centuries, depending on the value of YEAR_SPLIT, for each TSV file found.

Next, use the scripts/data_ob.py to process the data and write several necessary files to be used by D-ETM.

./data_ob.py TSV_CORPUS [MAX_DF] [MIN_DF]

The TSV_CORPUS argument should be a file output by scripts/make-slices.py. The other two options indicate thresholds for filtering the corpus vocabulary. The MAX_DF parameter is between 0 and 1 and indicates the maximum proportion of documents a word can be in to be included in the vocabulary. The MIN_DF argument is between 0 and the number of documents in the corpus and indicates the number of documents in which a word must be present to be included in the vocabulary. This code outputs several files to a directory named scripts/TSV_CORPUS/min_df_MIN_DF_max_df_MAX_DF. See scripts/make_data.sh for more examples of how this is run.

Run D-ETM

To run actually run the model and the embeddings, use ./run_detm.sh. You can input the data file to use as the first argument, but to change the parameters for the model, you must edit the variables within the file itself.

./run_detm.sh TSV_CORPUS

First, this will run skipgram.py to generate word embeddings and save them to a file in data/TSV_CORPUS-embed. Use the dump_w2v.py code here to generate files you can use to visualize these embeddings using projector.tensorflow.org. Next, this script creates a directory called ./results/TSV_CORPUS which will contain the output of the model.

Visualize Results

To visualize the evolution of topics over the course of the data, run ./plot_word_evolution.py. Input the data directory generated from the scripts/data_ob.py step and the beta file generated by the run_detm.sh step. This is the file in the results/TSV_CORPUS directory ending in beta.mat. The --words_per_slice argument indicates how many top words from the topic being visualized you wish to plot from each of three distinct time slices.

./plot_word_evolution.py --data_dir=scripts/TSV_DATA/min_df_MIN_DF_max_df_MAX_DF/ --beta_file=results/TSV_CORPUS/detm_PARAMETERS_beta.mat --words_per_slice=3

This creates a subdirectory results/TSV_CORPUS/word_evolutions/BETA_FILE and saves all the topic plots within that directory.

Original Documentation

All the below information is included in the documentation of the original repository for D-ETM.

Explanation

The DETM is an extension of the Embedded Topic Model (https://arxiv.org/abs/1907.04907) to corpora with temporal dependencies. The DETM models each word with a categorical distribution whose parameter is given by the inner product between the word embedding and an embedding representation of its assigned topic at a particular time step. The word embeddings allow the DETM to generalize to rare words. The DETM learns smooth topic trajectories by defining a random walk prior over the embeddings of the topics. The DETM is fit using structured amortized variational inference with LSTMs.

Dependencies

  • python 3.6.7
  • pytorch 1.1.0

Datasets

The pre-processed UN and ACL datasets can be found below:

The pre-fitted embeddings can be found below:

All the scripts to pre-process a dataset can be found in the folder 'scripts'.

Example

To run the DETM on the ACL dataset you can run the command below. You can specify different values for other arguments, peek at the arguments list in main.py.

python main.py --dataset acl --data_path PATH_TO_DATA --emb_path PATH_TO_EMBEDDINGS --min_df 10 --num_topics 50 --lr 0.0001 --epochs 1000 --mode train

Citation

@article{dieng2019dynamic,
  title={The Dynamic Embedded Topic Model},
  author={Dieng, Adji B and Ruiz, Francisco JR and Blei, David M},
  journal={arXiv preprint arXiv:1907.05545},
  year={2019}
}

About

Fork of D-ETM repo for working with Old Bailey data.


Languages

Language:Python 97.5%Language:Shell 2.5%