raunak-agarwal / RedditSent-Models

Sentiment Models on Reddit

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RedditSent Models

Paper

0. Setup

Note: This code has been tested only on Linux but it shouldn't be difficult to rebuild everything on OSX/Windows.

Linux:

Install Python 3.6:

sudo add-apt-repository ppa:deadsnakes/ppa

sudo apt-get update

sudo apt-get install build-essential python3.6 python3.6-dev python3-venv git

Clone the Repo:

git clone git@github.com:raunak-agarwal/RedditSent-Models.git

Setup the virtual environment:

virtualenv --python=python3.6 venv

source venv/bin/activate

pip install -r requirements.txt

spacy download en_core_web_lg

1. Benchmarking against the SARC dataset

Self-Annotated Reddit Corpus (SARC) is the largest publicly available annotated corpus for reddit comments. We utilise comments from the balanced section of the corpus to benchmark our future models.

1.1 SARC Baseline

The SARC corpus provides a simple baseline: Average of glove embeddings fed into a logit classifier. We describe several different architectures which perform better.

1.2 Byte-Pair Encoding

Byte-Pair Encodings or BPE provide tokenisation with subword segmentation. We feed this into a TF-IDF + Logit pipeline.

(See)

1.3 Fasttext + Pretrained Vectors

Using the corpus described in Part 2, we create a dense unsupervised representation of the reddit vocabulary. These "pretrained" vectors are then finetuned using the SARC training files on a softmax loss function.

(See)

1.4 BERT + BiDirectional LSTM

(Code)

(Hyperparam Search Results)

Results

(See)

2. Building Topical Corpora via Pushshift

Pushshift is a free service that ingests real-time comments from Reddit. We query its API to create a corpus of comments from 5 of the biggest English-language political subreddits - r/politics, r/news, r/worldnews, r/unitedkingdom, r/europe. The corpus has around 7.5m comments and 150m word tokens. Download the preprocessed corpus here.

  1. Lexicons
  2. Data Scrape
  3. Filtering and Preprocessing
  4. Word Vectors

3. Data Annotation using Prodigy

Using the corpus created above, we annotate a subset of comments from r/politics. To perform data annotation, we use Prodigy with a custom recipe. Annotations are available here.

Prodigy

Note: Prodigy is not a free software

Try Prodigy with our r/politics corpus.

Contributors

Raunak Agarwal

Luka Borec

TODO

  1. Finish Documentation
  2. Dockerize
  3. Extend Annotations
  4. Extend Graphs
  5. Move files to S3
  6. Add Sentence Vectors
  7. Citations

LICENSE

GPL-3.0

TL;DR: You may copy, distribute and modify the software as long as you track changes/dates in source files. Any modifications to the software including (via compiler) code must also be made available under the GPL along with build & install instructions.

About

Sentiment Models on Reddit

License:GNU General Public License v3.0


Languages

Language:Python 53.4%Language:Jupyter Notebook 46.6%