jungrishi / forkkit

Web crawler to mine album review scores and metadata from pitchfork.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Forkkit - Pitchfork's album reviews scraper

Scraper in Python that will integrate my final project with a repository yet to come

Database as of 25th May 2020 => 20,077 albums reviewed


Installing the scraper

  1. Clone the repository
  2. On the terminal, create a virtual environment by typing
    $ virtualenv -p python3 .
    This project was conceived using Python 3.7
  3. To load the requirements, type on the terminal
    $ . bin/activate
    $ pip install -r requirements.txt

Installing the required libraries

  1. The script uses the excellent mapping tool peewee which you probably don't have installed. To get it, type
    $ pip install peewee
  2. It also uses the requests.html library for the heavylifting (parsing the HTML pages). To install, hit
    $ pip install requests-html
  3. To fetch the artworks' URL, I had to use BeautifulSoup because the URLs src are under a div/class/img tag. Src is an attribute and not a proper HTML tag, so the requests method does not really work for fetching a src URL under an img tag.
    $ pip install beautifulsoup4
  4. To parse and format the date into the YYYY-MM-DD format instead of 'January 1 2020', so the data is better handled by the SQL database. For that, the library htmldate was used. It can be downloaded by installing
    $ pip install htmldate
    $ pip install --upgrade htmldate
    $ pip install git+https://github.com/adbar/htmldate.git

Creating the database

  1. To create the database file with the preset tables, type
    $ python3 models.py
  2. VoilĂ ! You should have now in your folder an albums.db file

What exactly the scraper does?

  • The script parses all Pitchfork's album reviews. Yes, that's right. There are album reviews dating back from 1999... And they will be parsed too. As of today (May 2020) there are 1,876 published review pages, amounting to 20,141 unique album reviews.
  • As you can probably guess, I ain't got no time to browse each one of them manually.
  • The scraper therefore parses every single album review published on Pitchfork, collects and inserts the following data into the database:
    database id
    pitchfork's album review url
    publication date
    album score
    album year
    record label
    genre
    artwork URL
    review title
    artist
    album

Running the scraper

  1. To run the scraper, type on the terminal
    $ python3 forkkit.py
    and wait - gathering all this data may take a while!

Changing the variables

  1. In the forkkit.py file, you can change a couple of variables:
  • MAX_WORKERS = it can be increased to increase the running speed of the script.
  • RANGE = the number that worked best for me was 1-501, 501-1001, 1001-1501, etc... Iterations of 500 pages per turn, for a smooth run and data-check on my computer.
  • RECURSION_DEPTH = should be kept at 1 to avoid duplicates.

Notes

Special thanks to @nabaskes

About

Web crawler to mine album review scores and metadata from pitchfork.com

License:MIT License


Languages

Language:Python 100.0%