basaldella / reddit-scraper

A Reddit scraper based on Pushshift and Prawn.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reddit Scraper

A Python Reddit scraper based on Praw and Pushshift.

This script allows you to:

  • Download a list of posts;
  • Download a list of subreddits;
  • Make arbitrary API calls to Pushshift to build more refined datasets.

The usage should be pretty-explanatory. The only think you should know is that you will need to get an API Key from Reddit, copy it in reddit_config.sample.py, and rename the file to reddit_config.py.

usage: reddit-scraper.py [-h]
                         (--posts POSTS_FILE | --subs SUBS_FILE | --config CONFIG_FILE)
                         [--start START_DATE] [--end END_DATE] --output
                         OUTPUT_FOLDER [--blacklist BLACKLIST_FILE]
                         [--workers NUM_WORKERS]

Scrapes subreddits and puts their content in a plain text file.
Use with --posts to download posts, --subs to download
subreddits, and --config to make custom Pushshift API calls. 

optional arguments:
  -h, --help            show this help message and exit
  --posts POSTS_FILE    A file containing the list of posts to scrape, one per
                        line.
  --subs SUBS_FILE      A file containing the list of subreddits to scrape,
                        one per line.
  --config CONFIG_FILE  A file containing the arguments for the Pushshift
                        APIs.
  --start START_DATE    The date to start parsing from, in YYYY-MM-DD format
  --end END_DATE        The final date of the parsing, in YYYY-MM-DD format
  --output OUTPUT_FOLDER
                        The output folder
  --blacklist BLACKLIST_FILE
                        A file containing the lines to skip.
  --workers NUM_WORKERS
                        Number of parallel scraper workers

Examples:

  1. Download all posts in the subreddits specified in subreddits.txt, from January 1, 2015 to December 31, 2016, using 8 parallel processes, save them in scraped/, and ignoring the lines defined in blacklist.txt:

    python reddit-scraper.py --subs subreddits.txt --output scraped --start 2015-01-01 --end 2016-12-31 --workers 8 --blacklist blacklist.txt

  2. Download the post specified in posts.txt, and save them in scraped/:

    python reddit-scraper.py --posts posts.txt --output scraped

  3. Use the Pushshift API to look for posts in Reddit, using the parameters provided in config.default.txtfrom January 1, 2019 to January 2, 2019, using 8 parallel processes, and save them in scraped/:

    python reddit-scraper.py --config config.default.txt --output scraped --start 2019-01-01 --end 2019-01-02 --workers 8

About

A Reddit scraper based on Pushshift and Prawn.

License:Apache License 2.0


Languages

Language:Python 100.0%