RedditSent Models

Paper

0. Setup

Note: This code has been tested only on Linux but it shouldn't be difficult to rebuild everything on OSX/Windows.

Linux:

Install Python 3.6:

sudo add-apt-repository ppa:deadsnakes/ppa

sudo apt-get update

sudo apt-get install build-essential python3.6 python3.6-dev python3-venv git

Clone the Repo:

git clone git@github.com:raunak-agarwal/RedditSent-Models.git

Setup the virtual environment:

virtualenv --python=python3.6 venv

source venv/bin/activate

pip install -r requirements.txt

spacy download en_core_web_lg

1. Benchmarking against the SARC dataset

Self-Annotated Reddit Corpus (SARC) is the largest publicly available annotated corpus for reddit comments. We utilise comments from the balanced section of the corpus to benchmark our future models.

1.1 SARC Baseline

The SARC corpus provides a simple baseline: Average of glove embeddings fed into a logit classifier. We describe several different architectures which perform better.

1.2 Byte-Pair Encoding

Byte-Pair Encodings or BPE provide tokenisation with subword segmentation. We feed this into a TF-IDF + Logit pipeline.

(See)

1.3 Fasttext + Pretrained Vectors

Using the corpus described in Part 2, we create a dense unsupervised representation of the reddit vocabulary. These "pretrained" vectors are then finetuned using the SARC training files on a softmax loss function.

(See)

1.4 BERT + BiDirectional LSTM

(Code)

(Hyperparam Search Results)

Results

(See)

2. Building Topical Corpora via Pushshift

Pushshift is a free service that ingests real-time comments from Reddit. We query its API to create a corpus of comments from 5 of the biggest English-language political subreddits - r/politics, r/news, r/worldnews, r/unitedkingdom, r/europe. The corpus has around 7.5m comments and 150m word tokens. Download the preprocessed corpus here.

3. Data Annotation using Prodigy

Using the corpus created above, we annotate a subset of comments from r/politics. To perform data annotation, we use Prodigy with a custom recipe. Annotations are available here.

Note: Prodigy is not a free software

Try Prodigy with our r/politics corpus.

Contributors

Raunak Agarwal

Luka Borec

TODO

Finish Documentation
Dockerize
Extend Annotations
Extend Graphs
Move files to S3
Add Sentence Vectors
Citations

LICENSE

GPL-3.0

TL;DR: You may copy, distribute and modify the software as long as you track changes/dates in source files. Any modifications to the software including (via compiler) code must also be made available under the GPL along with build & install instructions.

raunak-agarwal / RedditSent-Models

RedditSent Models

0. Setup

Linux:

1. Benchmarking against the SARC dataset

1.1 SARC Baseline

1.2 Byte-Pair Encoding

1.3 Fasttext + Pretrained Vectors

1.4 BERT + BiDirectional LSTM

Results

2. Building Topical Corpora via Pushshift

3. Data Annotation using Prodigy

Contributors

TODO

LICENSE

About

Languages