gadatos / AmazonBasicsWebScraper

Puppeteer.js, Web Scraping - Script to scrape info from Amazon's web basics page

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Amazon Basics Web Scraper

Description:

Created a script to scrape web data from the AmazonBasics webpage. The script collects item information. This project serves as an exercise to demonstrate web scraping techniques using Puppeteer.js.

>> skip down to demo and results


Preview

Preview

Screen.Recording.2023-04-11.at.2.55.57.PM.mp4

Table of Contents
  1. Getting Started
  2. Preventing Endless Execution with the Timeout Option
  3. Scrolling Behavior
  4. Demo
  5. Acknowledgments

Getting Started

Instructions to get the copy of the project up and running on your local machine for development and testing purposes.

Built With

  • Puppeteer.js

Prerequisites

Project requires Node.js and npm installed.


Installing and Usage

To install dependencies, run the following command:

 npm install 

To run the script, use the following command:

 npm start 

Configuration

The script was configured with the following options:

  • headless: false - to display the browser's user interface. Determines whether to run the browser in headless mode.
  • userDataDir: './tmp' - a temporary directory created to store user data for the browser instance.

To modify these options, edit the puppeteer.launch() method in index.js.


(back to top)

Preventing Endless Execution with the Timeout Option

Timeout

The script includes a timeout option that determines how long puppeteer will wait for the product items to load. If the scrapper does not find 100 items within the specified time, it will stop and output the number of items it found. By default, the timeout is set to 30 seconds.

To modify the timeout, edit the timeout variable in script2.js.

Note that increasing the timeout can increase the time it takes for the script to complete, while decreasing the timeout can increase the risk of the scrapper not finding all 100 items. The timeout value should be set based on the performance of the website being scraped and the speed of your internet connection.


(back to top)

Scrolling Behavior

Viewport

The Amazon Basics store page loads more items as you scroll down the page, rather than requiring a click to go to the next page. This webpage format may depend on the viewport size, which we set to a consistent value using the following code:

 await page.setViewport({ width: 1280, height: 720 });

By setting the viewport size to a fixed width and height, we can ensure that the webpage format stays consist throughout other machines and we can follow the same method of scraping regardless of machine, by scrolling down.


While Loop

To ensure that the script finds all 100 product items on the Amazon Basics store page, we use the following while loop. The loop scrolls down the page until 100 items have been loaded, or until the specified timeout has been reached.

 while(itemsLoaded < 100 && Date.now() - start < timeout) {
        await page.evaluate(() => {
            window.scrollBy(0, window.innerHeight);
        });
        await page.waitForTimeout(1000); // wait 1 seconds for new items to load

        itemsLoaded = await page.$$eval(".ProductGridItem__image__ih70n", (items) => items.length);
 };

(back to top)

Demo

Movie

Screen.Recording.2023-04-11.at.2.55.57.PM.mp4

Results

Sample:

Screen Shot 2023-04-11 at 4 07 50 PM

(back to top)

Acknowledgments

(back to top)

About

Puppeteer.js, Web Scraping - Script to scrape info from Amazon's web basics page


Languages

Language:JavaScript 100.0%