muratcansarkalkan / ProductScrape

A script that scrapes Amazon TR and Trendyol products.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Product Scraper

Brief Description: A script that scrapes Amazon TR and Trendyol products.

Introduction:

  • Amazon, a multinational technology company and a leader in international e-commerce business and** Trendyol**, one of the local e-commerce leaders which is valued around $16 billion, are the most common webpages for shopping online.
  • In this program, we will parse the products by the query and number of results wanted from each webpage. There are no other parameters that are needed to be given. There are 4 scripts inside the program.
  • With multithread function, there is no need to wait for web parsing to be completed link-by-link. When parsing is complete for a product, it is written to the JSON file. In any event, we will have a JSON file, even though specific crashes occur.

How does it work?

  1. Run Windows PowerShell, then type python main.py.
  2. You will be asked for query first, type a query.
  3. Then you will be asked how many product results you want for each website. If you don't give a number, the program will be annoyed and quit.
  4. Then, the search starts. After search, product links are prepared to be scraped as a list. For Amazon's scraping, some products not related to our search query are discarded. Also, if a product has the same link, it is not included.
  5. Then the product results are appended to a list of links. These links are visited, then the product's title, price and specific ID (for Amazon, it is ASIN, while Trendyol, it's at end of the TITLE) with a link to visit.
  6. The outputs are written to a JSON file ("results.json") one by one, so when the whole process is complete, no successfully parsed result will be left out.
  • Timeout function is also included for both search result and product webpage parsing.
  • The program is available to scrape both Amazon and Trendyol.
  • Headers for URL requests are switched randomly, in order to increase efficiency of web scraping, as the websites doesn't allow visits without User Agents.
  • Here is an example of the program running.

Errors and Limitations:

  • Amazon's search results that HTML parser parsed can give 14 results per page. The number for Trendyol is 24.
  • Trendyol can load up to 208 pages. This means we can have a max results of 4992.
  • When the search link is unavailable, the script prints an error as "Sorry, we could not find any products regarding your search from Amazon/Trendyol. (HTTP 503 Error)" If a website is unavailable, the script continues running and moves onto next website. Amazon tends to have this error more than Trendyol does.
  • When the product link is unavailable, the script prints an error as "Sorry, the product could not be parsed."
  • After the parsing process is complete, you can open results.json to see product ID, name, price and link.

About

A script that scrapes Amazon TR and Trendyol products.


Languages

Language:Python 100.0%