mihneadumi / Python-Web-Scraper

An adaptive Python Web Scraper App to catch the best deals by scraping and parsing data from select E-Commerce sites.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Python Adaptive Web Scraper

A python based web scraper whose main goal is to get all the products from a website, being able to sort or filter them quickly and make them easily exportable to excel or csv for better data manipulation or statistics (or just for fun)

Currently Supported Websites

Used Technologies and principles

  • Requests and Selenium webdriver for getting the website's html
  • BeautifulSoup for parsing the retrieved html
  • Pandas for quick data manipulation and export
  • I've employed a simplified style layered arhitecture design consisting of an UI layer and a Repository layer to reduce dependencies and also to keep my project fairly modular so that I can easily add/remove support for certain websites in the future

Opening the app and Installation

  • Python 3.x is required to run the app.
  • After opening the project in your preffered IDE (VSCode has some import problems that I'll talk about a little bit later you need to to the following pip installs in the terminal:
  pip install requests
  pip install selenium
  pip install pandas
  pip install bs4
  • Run the 'main.py' file and paste the http link to the supported website you want to scrape.
  • For VSCode users only, if pylance is acting up, you need to go into IDE Settings (Ctrl+,) and search for "pylance path add' amd imder "Python › Analysis: Include" click on "Add Item" and enter the path of the folder in which the project is located (ex: "A:\Projects\Python-Web-Scraper"

How to use

After you run the main.py file you will be greeted by a preview of the list of products and a simple 6 option menu, in which you can select the option by entering corresponding number:

image

Here you can sort/filter the products list before exporting it or directly export it. The exported excel/csv will be located in the project folder under the given name.

Challenges faced

  • I've encountered a wierd interaction with javascript in some websites such as Altex, where the javascript triggers 0.1 seconds after loading the page and then it loads all the products on the page thus making the initial html pull that Requests not work properly
  • Also some websites cough cough again Altex came with some sort of poorly implemented bot protection which messed up the Requests's way of pulling the html so I had to use Selenium in order to trick the website into thinking a human is accesing it, and thus making the scraping a little bit slower since this method opens an Edge browser and then waits for the javascript to load the entire page with products.

About

An adaptive Python Web Scraper App to catch the best deals by scraping and parsing data from select E-Commerce sites.


Languages

Language:Python 100.0%