grikomsn / amazon-cell-phones-reviews

Scrape (un)locked cell phone ratings and reviews on Amazon

Home Page:https://www.kaggle.com/grikomsn/amazon-cell-phones-reviews

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Amazon Cell Phones Reviews

๐Ÿฑ Scrape (un)locked cell phone ratings and reviews on Amazon ๐Ÿ“ฑ

cat scraping

Features โœจ

  • Scrapes basic metadata with ratings and reviews
  • Scrape all or specific brands
  • Scrape unlocked, locked, or both cell phones
  • Use multiple Puppeteer pages as workers

Read more on personalizing setting at the configuration section.

Download Data ๐Ÿ“ซ

You can download pre-scraped datasets at Kaggle.

Manual Scrape ๐Ÿ”ง

Requirements ๐Ÿ“ƒ

Packages Used ๐Ÿ“ฆ

Steps ๐Ÿ‘จโ€๐Ÿ”ฌ

Preparation

  • Make sure the dependencies are downloaded by running npm install or yarn.
  • (Optional) Copy config.default.ts (this file is ignored with git) to config.ts and customize config variables on config.ts.

Using Visual Studio Code

  • Open the project directory in Visual Studio Code.
  • Select and execute Scrape Search Results in the launch options on the Debug tab (exported to ./data/yyyymmdd-results.csv).
  • Then select and execute Scrape Item Reviews (exported to ./data/yyyymmdd-reviews.csv).

Using Command Line

  • Run npm run scrape:items or yarn scrape:items first to scrape initial item results (exported to ./data/yyyymmdd-results.csv).
  • Then run npm run scrape:reviews or yarn scrape:reviews to scrape item reviews (exported to ./data/yyyymmdd-reviews.csv).

Available Scripts ๐Ÿ“

  • scrape:items

    Scrapes and saves entry results for review scraping.

  • scrape:reviews

    Scrapes and saves entry reviews based on scrape:items data.

  • format

    Format all .ts files.

  • format:data

    Format .json files in /data.

Configuration ๐Ÿ› 

  • brands - string[]

    Self explanatory.

    Defaults to ten major phone manufacturers, set to [] (empty array) to disable brand filtering and select all available brands.

    Note that by selecting all brands will not assign what brand it is, probably will implement this in future versions.

  • brandKeywords - {brand: string, keywords: string[]}

    Brand alternative names or keywords for brand assignment.

    Since the search page does not explicitly tell what brand it is, after scraping the results it determines from the items' URL and title by comparing brands and brandKeywords values.

  • categories - 'unlocked' | 'locked' | 'both'

    Also self explanatory.

    Whether scrape unlocked, locked, or both categories. If both, workers will scrape unlocked results first then locked results.

  • numberOfWorkers - number

    Number of active 'workers' or pages to use for scraping.

    Note that Amazon's server will assume too many requests or workers as an unusual traffic and will return a captcha page instead of the intended result page

Default Values

{
  brands: [
    'ASUS',
    'Apple',
    'Google',
    'HUAWEI',
    'Motorola',
    'Nokia',
    'OnePlus',
    'Samsung',
    'Sony',
    'Xiaomi',
  ],
  brandKeywords: [
    { brand: 'Apple', keywords: ['iPhone'] },
    { brand: 'Google', keywords: ['Pixel'] },
    { brand: 'HUAWEI', keywords: ['Honor'] },
    { brand: 'Motorola', keywords: ['Moto'] },
    { brand: 'Samsung', keywords: ['Haven'] },
    { brand: 'Sony', keywords: ['Xperia'] },
  ],
  categories: 'both',
  numberOfWorkers: 8,
}

License ๐Ÿ‘ฎโ€โ™‚๏ธ

CC0 1.0 Universal

About

Scrape (un)locked cell phone ratings and reviews on Amazon

https://www.kaggle.com/grikomsn/amazon-cell-phones-reviews

License:Creative Commons Zero v1.0 Universal


Languages

Language:TypeScript 100.0%