Multithreaded-amazon-scraper

Description

This package allows you to search and scrape for products on Amazon and extract some useful information (price, ratings, number of comments).

pip3 install -r requirements.txt

python3 example.py -w <word you want to search>

Above step with create a .json file(in same directory as example.py) with the products that were found.
For more help just run:

python3 example.py --help

Attribute name	Description
url	Product URL
title	Product title
price	Product price
rating	Rating of the products
review_count	Number of customer reviews
img_url	Image URL
bestseller	Tells whether a product is best seller or not
prime	Tells if product is supported by Amazon prime or not
asin	Product ASIN (Amazon Standard Identification Number)

Output is provided in the from of a json file, please refer to the products.json as an example file which was produced with search word 'toaster'

scraper.py, In method get_page_content, retries were added to make a valid connection with amazon servers even if it connection request was denied.
function -> get_request, returns None when requests.exceptions.ConnectionError occurs and ripples its way down to calling functions to terminate the thread normally instead of abruptly calling sys.exit() which surely will kill the thread but if the thread being killed holds GIL component, in that case it will lead to Deadlock.
function -> get_page_content, if no valid page was found even after retries it returns None in addition to returning None for Nonetype response from get_request.
Decision number 2 and 3 were made keeping in mind that in a multithreaded program, multiple threads are working simultaneously, while doing that there may be a case where 1 or 2 out of 10 or 20 threads does not get valid response (Please check check_page_validity and get_request function for documentation and more), then we terminate only those threads safely while others work to produce the valid output.

On my network connection (results may vary depending on your connection speed)

Write Unit tests
Implement functionality of sending requests from various differnt proxies
Items like Books and DVDs may have multiple prices, Extact all the prices and categorize them into a price dictionary
Add a better way to convert list of objects into json
To handle special characters in the content scraped from Amazon

Scraps the search results and generates a JSON file to store the information(like title, price, rating stars... etc) about those results.

GNU General Public License v3.0

Language:Python 100.0%