This repository contains the code and data needed to replicate the analyses presented in the paper Who supports Bernie? analyzing identity and ideological variation on Twitter during the 2020 Democratic primaries by shuster, Madani, Campos-Castillo, and Joseph. If you use this code base or ideas from the paper, please consider citing us!
CITATION FORTHCOMING
Due to limitations on what can be shared from the Twitter API, we do not share the raw tweets. However:
- These data are available upon request from the last author
- We provide data that is sufficient to replicate all analyses in the present work aside from data collection and the data used to construct of the trimmed retweet network
- We provide code that shows how raw data was collected and how this trimmed retweet network was constructed.
This repository therefore, we believe, contains sufficient data needed to replicate and extend the results presented in the paper. Please open an issue if you have questions!
The replication codebase consist of five code files and several data files used in them. Each of the files containing code are described below:
-
01_collect_data.py
- Thispython
script usestweepy
to listen to the Twitter v1.1 Streaming API for a list of keywords and/or mentions of specific handles. Keywords and handles we listened for are described in the article text, the final list of keywords used is contained in theapi_query.json
file here. To connect to the (now depreciated) streaming API, we called this script at the command line using the statementpython 01_collect_data.py [PATH_TO_API_CREDENTIALS] api_query.json [OUTPUT_DIRECTORY_LOCATION]
. -
02_preprocess_data.ipynb
- Thispython
notebook takes the raw output from01_collect_data.py
and transforms it into a series ofparquet
files used for downstream analysis. Of particular note is the creation of thetrimmed_rtnet
directory, which contains the data that we run VSP on to generate the clusters we use to identify Bernie-supporting accounts (i.e. our "who retweets whom" analysis). A tarred and gzipped version of this dataset is available in thedata
directory, please make sure to untar before attempting replication. Additionally, please note that due to the size of the file, we have used git-lfs to upload it to this repository. You will need to enable git-lfs to download it. -
03_run_vsp.ipynb
- ThisR
notebook provides the code needed to replicate our use of VSP in the paper for both identifying Bernie-supporting accounts and identifying different sets of framed values. Results for the who retweets whom clustering (and the experiments using different values ofk
) are available here as a (16GB!).rdata
file. Our results for the "who retweets what" clustering are available atdata/rt_tweet_clustering.rdata
in this repository, but note that we have used git-lfs again with this file. -
04_post_vsp_dataprocessing.ipynb
- Thispython
notebook generates various intermediary datasets used in our results. All of these datasets are included in this repository in thedata
directory as.tgz
files, please make sure to untar them before replicating our work. -
05_plot_results.ipynb
- ThisR
notebook provides code to generate all of the main figures in the paper, using data provided in this repository.
Please let us know if you have any issues!