Replication data for "Automated Coding of Political Campaign Advertisement Videos: An Empirical Validation Study" by Alexander Tarr, June Hwang, and Kosuke Imai.
- Overview
- Repository Structure
- Data
- Installation
- Preprocessing the WMP Data
- Figure and Table Replication
- Additional Notes
Full replication of the results in the paper is a laborious process, involving significant setup and computation time on the part of the user. To simplify the procedure, we have split replication into three steps:
Each step may also be executed separately using pre-computed results provided in this repository. For those seeking only to validate the results in the paper, it is highly recommended to skip the first two steps, feature extraction and prediction, and focus on the steps for validation. The reason is that the former requires the acquision of YouTube videos, which cannot be publicly shared, and the use of Google Cloud Platform (GCP), which costs money, as well as a laborious process of installing various software packages. As both YouTube and GCP change constantly, the exact replication of these two steps may not be possible. We provide the intermediate results from both feature extraction and prediction steps so that users can replicate the validation results easily.
We provide instructions for replicating the Validation step in this document, while instructions for replicating feature extraction and prediction are found in README-FE.md and README-PR.md, respectively.
This repository is split into several folders: data
, figs
, results
, scripts
and tables
.
data
: This folder contains all data needed to perform both feature extraction and validation.figs
: PDFs for figures generated by the code that are displayed in the paper.results
: CSV files containing predicted labels for tasks studied in the paper. There are also raw text files showing general statistics about the performance of our methods that are discussed in the main text of the paper.scripts
: All code needed to generate data, extract features, validate results, and create figures and tables.tables
: Raw text files showing confusion matrices and coverage tables corresponding to tables in the paper.
Replication in the Validation step requires the human-coded labels provided by WMP, which cannot be shared publicly. This data can be purchased here. Our study used the 2012 Presidential, 2012 Non-Presidential, and 2014 data. The data is distributed across 7 Stata files, one for each year and race type (House, Senate, Governor, President). These files should be placed in the data/wmp
folder.
To replicate the validation step for creating all figures and tables, follow the instructions below in order as they appear.
Recreating all figures, tables and results requires working installations of
- Python, version 3.9 or greater. We recommend using the Anaconda distribution if unfamiliar with Python.
- R, version 4.0 or greater.
All code in this repo was tested under Python version 3.9.7 and R version 4.0.5 on a Windows 10 machine.
All Python code in the validation step uses the following packages: matplotlib, numpy, pandas, scikit-learn, seaborn
, all of which can be installed via
pip install <PACKAGE_NAME>
All R code uses the following packages: dplyr, here, lme4, quanteda, quanteda.sentiment, readstata13, readtext, stargazer, xtable
, most of which can be installed from within the R environment via
install.packages("<PACKAGE_NAME>")
quanteda.sentiment
is not available on CRAN and must be installed via
devtools::install_github("quanteda/quanteda.sentiment")
Before any results can be produced, the WMP data must be cleaned. After placing the Stata files into data/wmp
, clean the data via
Rscript scripts/preprocess_CMAG.R
This file may also be sourced from within an IDE, such as RStudio. Be sure to set the working directory to repo folder, campvideo-data
. After running, a file called wmp_final.csv
should be created in data/wmp
.
All figure and table replication scripts are in the scripts
folder. The files are named after the figures and tables they replicate. For example, figure5.R
recreates Figure 5, and tableS14-6.py
recreates Appendix Table S14.6. Note that some scripts create multiple tables or figures.
The full list of figures and tables and associated replication code is given below.
Result | Description | Language | Script |
---|---|---|---|
Figure 5 | MTurk results for issue mentions | R | figure5.R |
Figure 8 | MTurk results for ominous/tense mood classification | R | figure8_S14-9_S14-10.R |
Figure S7.4 | Video summarization validation study results | Python | figureS7-4.py |
Figure S13.8 | ROC plots for face recognition | Python | figureS13-8.py |
Figure S14.9 | MTurk results for uplifting mood classification | R | figure8_S14-9_S14-10.R |
Figure S14.10 | MTurk results for sad/sorrowful mood classification | R | figure8_S14-9_S14-10.R |
Table 1 | Matched video coverage table | R | table1.R |
Table 2 | Confusion matrices for issue mentions | Python | table2.py |
Table 3 | Confusion matrices for opponent mentions | Python | table3.py |
Table 4 | Confusion matrices for face recognition | Python | table4.py |
Table 5 | Confusion matrices for mood classiification | Python | table5.py |
Table 6 | Confusion matrices for ad negativity classification (NSVM) | Python | table6.py |
Table S1.1 | YouTube channel coverage table | R | tableS1-1.R |
Table S14.5 | Confusion matrix for mood MTurk results | Python | tableS14-5.py |
Table S14.6 | Confusion matrices for ad negativity classification (All) | Python | tableS14-6.py |
Table S14.7 | Confusion matrix for LSD results | R | tableS14-7.R |
Table S14.8 | Regression coefficients for issue convergence study | R | tableS14-8.R |
Python scripts can be executed via
python scripts/<SCRIPT>
and R scripts can be executed via
Rscript scripts/<SCRIPT>
where <SCRIPT>
is given by the name in the "Script" column in the table above.
- Preprocessing the data can take up to 10-15 minutes. Recreating the figures and tables using pre-computed results only takes a few minutes.
- Some confusion matrices in
tables
will differ slightly from what is displayed in the paper. This is due to significant figure truncation, which does not guarantee the sum of the confusion matrices add up to 100% after rounding. The values in the paper have been adjusted to add up to 100%. - 'File not found' errors are likely due to issues with working directory. All code assumes this repo,
campvideo-data
, is the working directory.