howardyclo / Digestant

Modules for effectively digesting data from Twitter and Reddit using ML, NLP and statistics.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Digestant

See more on introduction slides, project survey and demo.

Dev Environment

  • Python 3.x

Setup

  • Recommended to create a new virtual environment to manage your python project.
  • Download python packages from requirements.txt: $ pip install -r requirements.txt.
  • Download NLTK data: $ python -m nltk.downloader all.
  • Download SpaCy en_core_web_md model: $ python -m spacy download en_core_web_md.
  • Download stanford-ner-xxxx-xx-xx zip file Stanford NER model
    1. Download from the official website.
    2. Unzip and place the stanford-ner-xxxx-xx-xx folder the project root path. The name of folder should also be stanford-ner/.

Usage

  1. Create a twitter and reddit account, follow the accounts that you are interested in.
  2. Copy config-sample.json and rename it to config.json in the same directory. Remember to fill the keys in config.json. (Go to your twitter/reddit developer console, create application and get keys.)
  3. We need to crawl twitter data, so run the script crawlers/twitter_crawler.py. It will automatically crawl data and save them to dataset/twitter/ by default.
  4. You can customize data entities by modifying domains.json and types.json. (See demo)
  5. Currently, you can execute demo/demo_howard.ipynb or other notebooks to see daily digest.

About

Modules for effectively digesting data from Twitter and Reddit using ML, NLP and statistics.


Languages

Language:Jupyter Notebook 96.8%Language:Python 3.2%