ProjectX-2021

This is the GitHub repository for Cornell Data Science's 2021 ProjectX team. Members: Alexander Wang, Jerry Sun, Kevin Zhou, Kaitlyn Chen, Edward Gu, Melinda Fang.

Arxiv Paper

https://arxiv.org/abs/2207.01483

Instructions for how to replicate our results:

In the following headings, we describe how each of our subproject experiments can be replicated.

Data Collection

The USC data source can be found here: https://github.com/echen102/COVID-19-TweetIDs

The CMU data source can be downloaded here: https://zenodo.org/record/4024154#.YVTHH5rMJPZ

The CMU data source can be hydrated manually by running the data/MiscovData.ipynb file. The only features added are the text from the tweets. A note on hydration is that the tweets are hydrated if they still exist at the time of hydration (i.e. if a tweet has been deleted then it won't show up anymore).

The USC data source is exhaustive and contains over ~2 billion tweets. The data/CovMasterSet.ipynb notebook samples tweet ids from the collection and compiles them into a CSV file. This file's hydration can then be completed through the use of the Hydrator application which can be downloaded locally and then used to populate features according to the documentation. Some further data population and pre-processing occurs under the Virality Analysis section under the pre-processing notebook.

ClaimBuster

In this section you can find code to classify claims and non-claims.

Bi-Directional LSTM implementation of ClaimBuster
link to original ClaimBuster repository

Tweet Legitimacy Classifier

Check the README.md inside of the tweet-legitimacy-classifier directory.

Virality Analysis

The pre-processing conducted can be run here: https://colab.research.google.com/drive/1AcasEIEHxz07N9FJ5EUmLitTqrVOlKhk?usp=sharing

Ensure that when running any of the drive/file path commands that your local directories match either the given paths or that you alter the file path to match the structure of your local system.

Some notes on running the pre-processing notebook:

The first few code blocks of the preprocessing notebook in Colab include data scraping through the Twitter API which takes several hours to complete.
Running the processed text through BERT to obtain the word vector embeddings also takes several hours. There is also some difficulty in running the data through the BERT model causing the kernel to crash fairly frequently. There is a tedious work around that involves running the code cell that initializes and creates the datasets as well as the loop mechanism cell that feeds the data through the BERT forward pass manually repeatededly (our attempts at automating this process were not successful as the kernel would crash). Follow the comments in those cells to run that process properly.

The classification and regression models can be found and run with ease here: https://colab.research.google.com/drive/1nArfr4hv7V-is2LYgz4PLyuqHcDkKOIc?usp=sharing

Full Pipeline Analysis

The curation of data and analysis of results can be found in this notebook: https://colab.research.google.com/drive/1uF7UoZY55Ybmh0TlvNPAbW4V_jJK0uFa?usp=sharing

Note that this notebook does not include code that runs the data through the entire pipeline. It only includes the creation of the dataset as well as any analysis or derived insights.

CornellDataScience / ProjectX-2021