Forkkit - Pitchfork's album reviews scraper

Clone the repository
On the terminal, create a virtual environment by typing
$ virtualenv -p python3 .
This project was conceived using Python 3.7
To load the requirements, type on the terminal
$ . bin/activate
$ pip install -r requirements.txt

The script uses the excellent mapping tool peewee which you probably don't have installed. To get it, type
$ pip install peewee
It also uses the requests.html library for the heavylifting (parsing the HTML pages). To install, hit
$ pip install requests-html
To fetch the artworks' URL, I had to use BeautifulSoup because the URLs src are under a div/class/img tag. Src is an attribute and not a proper HTML tag, so the requests method does not really work for fetching a src URL under an img tag.
$ pip install beautifulsoup4
To parse and format the date into the YYYY-MM-DD format instead of 'January 1 2020', so the data is better handled by the SQL database. For that, the library htmldate was used. It can be downloaded by installing
$ pip install htmldate
$ pip install --upgrade htmldate
$ pip install git+https://github.com/adbar/htmldate.git

The script parses all Pitchfork's album reviews. Yes, that's right. There are album reviews dating back from 1999... And they will be parsed too. As of today (May 2020) there are 1,876 published review pages, amounting to 20,141 unique album reviews.
As you can probably guess, I ain't got no time to browse each one of them manually.
The scraper therefore parses every single album review published on Pitchfork, collects and inserts the following data into the database:
database id
pitchfork's album review url
publication date
album score
album year
record label
genre
artwork URL
review title
artist
album

To run the scraper, type on the terminal
$ python3 forkkit.py
and wait - gathering all this data may take a while!

MAX_WORKERS = it can be increased to increase the running speed of the script.
RANGE = the number that worked best for me was 1-501, 501-1001, 1001-1501, etc... Iterations of 500 pages per turn, for a smooth run and data-check on my computer.
RECURSION_DEPTH = should be kept at 1 to avoid duplicates.

Special thanks to @nabaskes

jungrishi / forkkit