This is the code related to the article .
Note that this is a proof-of-concept and not a production-ready pipeline. It is not meant to be used in production by any means, but rather to demonstrate the potential of the approach.
$ python3 -m venv .venv
$ source .venv/bin/activate
$ pip install -r requirements.txt
To run the Scrapy spider on a specific Amazon url, you can do:
$ scrapy runspider \
absa/scraping/amazon.py \
-O reviews.csv \
-a start_url='https://www.amazon.com/FIODIO-Comfortable-Anti-Ghosting-Resistant-Multimedia/product-reviews/B086168Y25/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews'
It will automatically follow the next pages until the required number of items have been scraped.
You can also play with the notebook notebooks/scraping.ipynb
to see how the scraping works.
The documented analysis pipeline is in notebooks/analysis.ipynb
.