digitalmethodsinitiative / dmi-amazon-recscraper

Amazon Recommendation Scraper

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Amazon Recommendation Scraper

This is command-line scraper for Amazon's product recommendations, generating network files that can be used to find patterns, 'rabbit holes' and so on in Amazon's recommendations.

This is far from the first code written to do this, but Amazon makes scraping quite difficult, no longer offering APIs and not making naive request-based scraping easy either.

This script uses Selenium via Python to use an actual browser to go to the Amazon page, and then extracts the recommendations found on the page. This approach has drawbacks: it's slow and resource-intensive. Scraping takes 10-30 seconds per item with this approach. But it works, for now.

Install

To use the scripts you need to install Firefox and a compatible geckodriver. There are many guides out there on how to do the latter (e.g. these for macOS or Windows).

After you've done that, download the scraper scripts (e.g. by cloning this git repository locally) and run pip install -r requirements.txt to install the required libraries (if pip does not work, try pip3).

Usage

After installing the dependencies, simply invoke the script from the command line:

python scrape.py -i seeds.txt

This assumes you have a file seeds.txt that contains Amazon product page URLs - one per line - which the recommendations will be scraped for. The results will be saved in GDF files you can open with a network analysis application such as Gephi.

There are other command-line parameters too:

python scrape.py --input seeds.txt --depth 1 --prefix seeds-txt-depth-1

Try python scrape.py --help for a full list. Be very careful with the --depth parameter: the amount of items that will be scraped increases extremely quickly if you set this to something greater than 0.

A few notes on the data

The GDF files are generated per recommendation type. If you have used Amazon, you may have noticed that it provides many recommendations: 'customers also bought', 'customers also viewed', 'other 4 star items', and so on. These are saved as separate networks, so you will get one network generated with the 'customers also bought' recommendations found on each scraped page, and so on.

Note that these recommendations are generated by going to Amazon with a browser. When you start scraping, the browser always starts with a blank slate. Nevertheless, Amazon may notice requests from e.g. the same IP address and use the pages that you scrape as input for their recommendations. Some of the recommendations ('recommended for you') explicitly do so. Remember this when analysing the data!

Credits & license

This software was developed by Stijn Peeters at the Digital Methods Initiative, and is distributed under the Mozilla Public License 2.0 license. See LICENSE for details.

About

Amazon Recommendation Scraper

License:Other


Languages

Language:JavaScript 63.1%Language:Python 36.9%