Note: This code has been tested only on Linux but it shouldn't be difficult to rebuild everything on OSX/Windows.
Install Python 3.6:
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt-get update
sudo apt-get install build-essential python3.6 python3.6-dev python3-venv git
Clone the Repo:
git clone git@github.com:raunak-agarwal/RedditSent-Models.git
Setup the virtual environment:
virtualenv --python=python3.6 venv
source venv/bin/activate
pip install -r requirements.txt
spacy download en_core_web_lg
Self-Annotated Reddit Corpus (SARC) is the largest publicly available annotated corpus for reddit comments. We utilise comments from the balanced section of the corpus to benchmark our future models.
The SARC corpus provides a simple baseline: Average of glove embeddings fed into a logit classifier. We describe several different architectures which perform better.
Byte-Pair Encodings or BPE provide tokenisation with subword segmentation. We feed this into a TF-IDF + Logit pipeline.
(See)
Using the corpus described in Part 2, we create a dense unsupervised representation of the reddit vocabulary. These "pretrained" vectors are then finetuned using the SARC training files on a softmax loss function.
(See)
(Code)
(See)
Pushshift is a free service that ingests real-time comments from Reddit. We query its API to create a corpus of comments from 5 of the biggest English-language political subreddits - r/politics, r/news, r/worldnews, r/unitedkingdom, r/europe. The corpus has around 7.5m comments and 150m word tokens. Download the preprocessed corpus here.
Using the corpus created above, we annotate a subset of comments from r/politics. To perform data annotation, we use Prodigy with a custom recipe. Annotations are available here.
Note: Prodigy is not a free software
Try Prodigy with our r/politics corpus.
- Finish Documentation
- Dockerize
- Extend Annotations
- Extend Graphs
- Move files to S3
- Add Sentence Vectors
- Citations
TL;DR: You may copy, distribute and modify the software as long as you track changes/dates in source files. Any modifications to the software including (via compiler) code must also be made available under the GPL along with build & install instructions.