csuttles / pyscraper

Thread safe fun with requests

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

What is this?

This is just another URL scraper in Python

Why?

This is a simple example of threadsafe use of requests, with retry, backoff, and support for user supplied headers. It's simple enough to be easy to comprehend, but also exposes tunables and options for flexibility and seeing where the point of diminishing returns is for parallelized requests.

Install

  • git clone this repo.
  • create a virtualenv: python3 -m venv venv
  • activate venv: . ./venv/bin/activate
  • install requirements: pip -r requirements.txt
  • run the program, as described in "usage"

Usage

./scrape.py --help
usage: scrape.py [-h] [-f FILE] [-l LOGFILE] [-t THREADS] [-H HEADERS]
                 [--timeout TIMEOUT] [-d]

optional arguments:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  file to read (list of URLs to scrape, default =>
                        list.txt)
  -l LOGFILE, --logfile LOGFILE
                        logfile to write (default => log.txt)
  -t THREADS, --threads THREADS
                        number fo threads to run
  -H HEADERS, --headers HEADERS
                        headers to pass to the worker, specified in json
                        format
  --timeout TIMEOUT     timeout for each request
  -d, --debug           enable debugging

The included list.txt is this handy gist. It can also be fun to use your personal browser history or bookmarks.

About

Thread safe fun with requests

License:MIT License


Languages

Language:Python 100.0%