campvideo-data

Replication data for "Automated Coding of Political Campaign Advertisement Videos: An Empirical Validation Study" by Alexander Tarr, June Hwang, and Kosuke Imai.

Overview
Repository Structure
Data
Installation
Preprocessing the WMP Data
Figure and Table Replication
Additional Notes

Overview

Full replication of the results in the paper is a laborious process, involving significant setup and computation time on the part of the user. To simplify the procedure, we have split replication into three steps:

Each step may also be executed separately using pre-computed results provided in this repository. For those seeking only to validate the results in the paper, it is highly recommended to skip the first two steps, feature extraction and prediction, and focus on the steps for validation. The reason is that the former requires the acquision of YouTube videos, which cannot be publicly shared, and the use of Google Cloud Platform (GCP), which costs money, as well as a laborious process of installing various software packages. As both YouTube and GCP change constantly, the exact replication of these two steps may not be possible. We provide the intermediate results from both feature extraction and prediction steps so that users can replicate the validation results easily.

We provide instructions for replicating the Validation step in this document, while instructions for replicating feature extraction and prediction are found in README-FE.md and README-PR.md, respectively.

Repository Structure

This repository is split into several folders: data, figs, results, scripts and tables.

data: This folder contains all data needed to perform both feature extraction and validation.
figs: PDFs for figures generated by the code that are displayed in the paper.
results: CSV files containing predicted labels for tasks studied in the paper. There are also raw text files showing general statistics about the performance of our methods that are discussed in the main text of the paper.
scripts: All code needed to generate data, extract features, validate results, and create figures and tables.
tables: Raw text files showing confusion matrices and coverage tables corresponding to tables in the paper.

Data

Replication in the Validation step requires the human-coded labels provided by WMP, which cannot be shared publicly. This data can be purchased here. Our study used the 2012 Presidential, 2012 Non-Presidential, and 2014 data. The data is distributed across 7 Stata files, one for each year and race type (House, Senate, Governor, President). These files should be placed in the data/wmp folder.

Validation

To replicate the validation step for creating all figures and tables, follow the instructions below in order as they appear.

Installation

Recreating all figures, tables and results requires working installations of

Python, version 3.9 or greater. We recommend using the Anaconda distribution if unfamiliar with Python.
R, version 4.0 or greater.

All code in this repo was tested under Python version 3.9.7 and R version 4.0.5 on a Windows 10 machine.

Python Dependencies

All Python code in the validation step uses the following packages: matplotlib, numpy, pandas, scikit-learn, seaborn, all of which can be installed via

pip install <PACKAGE_NAME>

R Dependencies

All R code uses the following packages: dplyr, here, lme4, quanteda, quanteda.sentiment, readstata13, readtext, stargazer, xtable, most of which can be installed from within the R environment via

install.packages("<PACKAGE_NAME>")

quanteda.sentiment is not available on CRAN and must be installed via

devtools::install_github("quanteda/quanteda.sentiment")

Preprocessing the WMP Data

Before any results can be produced, the WMP data must be cleaned. After placing the Stata files into data/wmp, clean the data via

Rscript scripts/preprocess_CMAG.R

This file may also be sourced from within an IDE, such as RStudio. Be sure to set the working directory to repo folder, campvideo-data. After running, a file called wmp_final.csv should be created in data/wmp.

Figure and Table Replication

All figure and table replication scripts are in the scripts folder. The files are named after the figures and tables they replicate. For example, figure5.R recreates Figure 5, and tableS14-6.py recreates Appendix Table S14.6. Note that some scripts create multiple tables or figures.

The full list of figures and tables and associated replication code is given below.

Result	Description	Language	Script
Figure 5	MTurk results for issue mentions	R	`figure5.R`
Figure 8	MTurk results for ominous/tense mood classification	R	`figure8_S14-9_S14-10.R`
Figure S7.4	Video summarization validation study results	Python	`figureS7-4.py`
Figure S13.8	ROC plots for face recognition	Python	`figureS13-8.py`
Figure S14.9	MTurk results for uplifting mood classification	R	`figure8_S14-9_S14-10.R`
Figure S14.10	MTurk results for sad/sorrowful mood classification	R	`figure8_S14-9_S14-10.R`
Table 1	Matched video coverage table	R	`table1.R`
Table 2	Confusion matrices for issue mentions	Python	`table2.py`
Table 3	Confusion matrices for opponent mentions	Python	`table3.py`
Table 4	Confusion matrices for face recognition	Python	`table4.py`
Table 5	Confusion matrices for mood classiification	Python	`table5.py`
Table 6	Confusion matrices for ad negativity classification (NSVM)	Python	`table6.py`
Table S1.1	YouTube channel coverage table	R	`tableS1-1.R`
Table S14.5	Confusion matrix for mood MTurk results	Python	`tableS14-5.py`
Table S14.6	Confusion matrices for ad negativity classification (All)	Python	`tableS14-6.py`
Table S14.7	Confusion matrix for LSD results	R	`tableS14-7.R`
Table S14.8	Regression coefficients for issue convergence study	R	`tableS14-8.R`

Python scripts can be executed via

python scripts/<SCRIPT>

and R scripts can be executed via

Rscript scripts/<SCRIPT>

where <SCRIPT> is given by the name in the "Script" column in the table above.

Additional Notes

Preprocessing the data can take up to 10-15 minutes. Recreating the figures and tables using pre-computed results only takes a few minutes.
Some confusion matrices in tables will differ slightly from what is displayed in the paper. This is due to significant figure truncation, which does not guarantee the sum of the confusion matrices add up to 100% after rounding. The values in the paper have been adjusted to add up to 100%.
'File not found' errors are likely due to issues with working directory. All code assumes this repo, campvideo-data, is the working directory.

atarr3 / campvideo-data