Features โจ
- Scrapes basic metadata with ratings and reviews
- Scrape all or specific brands
- Scrape unlocked, locked, or both cell phones
- Use multiple Puppeteer pages as workers
Read more on personalizing setting at the configuration section.
Download Data ๐ซ
You can download pre-scraped datasets at Kaggle.
Manual Scrape ๐ง
Requirements ๐
Packages Used ๐ฆ
puppeteer
for browser-based scrapingprettier
for formatting source codests-node
for running TypeScript scripts
Steps ๐จโ๐ฌ
Preparation
- Make sure the dependencies are downloaded by running
npm install
oryarn
. - (Optional) Copy
config.default.ts
(this file is ignored with git) toconfig.ts
and customize config variables onconfig.ts
.
Using Visual Studio Code
- Open the project directory in Visual Studio Code.
- Select and execute Scrape Search Results in the launch options on the Debug tab (exported to
./data/yyyymmdd-results.csv
). - Then select and execute Scrape Item Reviews (exported to
./data/yyyymmdd-reviews.csv
).
Using Command Line
- Run
npm run scrape:items
oryarn scrape:items
first to scrape initial item results (exported to./data/yyyymmdd-results.csv
). - Then run
npm run scrape:reviews
oryarn scrape:reviews
to scrape item reviews (exported to./data/yyyymmdd-reviews.csv
).
Available Scripts ๐
-
scrape:items
Scrapes and saves entry results for review scraping.
-
scrape:reviews
Scrapes and saves entry reviews based on
scrape:items
data. -
format
Format all
.ts
files. -
format:data
Format
.json
files in/data
.
Configuration ๐
-
brands
-string[]
Self explanatory.
Defaults to ten major phone manufacturers, set to
[]
(empty array) to disable brand filtering and select all available brands.Note that by selecting all brands will not assign what brand it is, probably will implement this in future versions.
-
brandKeywords
-{brand: string, keywords: string[]}
Brand alternative names or keywords for brand assignment.
Since the search page does not explicitly tell what brand it is, after scraping the results it determines from the items' URL and title by comparing
brands
andbrandKeywords
values. -
categories
-'unlocked' | 'locked' | 'both'
Also self explanatory.
Whether scrape unlocked, locked, or both categories. If both, workers will scrape unlocked results first then locked results.
-
numberOfWorkers
-number
Number of active 'workers' or pages to use for scraping.
Note that Amazon's server will assume too many requests or workers as an unusual traffic and will return a captcha page instead of the intended result page
Default Values
{
brands: [
'ASUS',
'Apple',
'Google',
'HUAWEI',
'Motorola',
'Nokia',
'OnePlus',
'Samsung',
'Sony',
'Xiaomi',
],
brandKeywords: [
{ brand: 'Apple', keywords: ['iPhone'] },
{ brand: 'Google', keywords: ['Pixel'] },
{ brand: 'HUAWEI', keywords: ['Honor'] },
{ brand: 'Motorola', keywords: ['Moto'] },
{ brand: 'Samsung', keywords: ['Haven'] },
{ brand: 'Sony', keywords: ['Xperia'] },
],
categories: 'both',
numberOfWorkers: 8,
}