oscarnevarezleal / ecommerce-crawler

Parallel ecommerce crawler using Docker and Puppeter on GCP

Home Page:https://oscarnevarezleal.github.io/ecommerce-crawler/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Before you start

  • This is mostly a proof of concept and have tons of improvements to be made. It goes without saying that this is not a production-ready project.
  • Altought is very easy to port to AWS its is not planned in a near future. You can ways fork and send a PR
  • Note: If you don’t have gcloud installed please refer to gcloud sdk installation
  • Note: After you configre your project in GCP, make sure you got your own service-key.json file
  • Note: if this is your first time using GCP, you need to authenticate your laptop to use GCP services by running gcloud auth login from the command line.

Overview

This project takes a bunch of urls and it transform them into Jobs , these Jobs are published into Pub/Nub and awaits there until they´re read by the Worker. When the Worker became aware of one Job it spawn a new crawler using Puppeteer, after the content has been grabbed the result is persisted in DataStore and the worker moves to the next Job in line. This process is repeated as long as there are Jobs.

Usage

Environment

The following environment variables must be setted before testing

  • GAE_APPLICATION - the name of your GCP application
  • GOOGLE_APPLICATION_CREDENTIALS - the path to your service-key.json

Steps

Follow this steps to run your application locally

  • Clone this repository
  • Run npm install
  • Rename urls.sample.js to urls.js and include the urls you want to crawl
  • Rename config.sample.js to config.js and edit gcp and descriptor section
# generate the messages
node index.js
# when finished run the worker
node src/worker.js

Config

Descriptor

Descriptor object is the backbone of this crawler, in here you specify each one of the things you want to grab from page.

Property Type Comments
name String
primary Boolean Wether this is a primary attribute ( think about saving process)
required Boolean If is set to true execution will stop when element is not found
selector Boolean A valid CSS selector
attribute String Optional attribute to grab from selector
format Function A callback to format the value grabbed

Puppeteer

Puppeteer configuration

Property Default Comments
waitUntil load When to consider navigation succeeded, defaults to load. Given an array of event strings, navigation is considered to be successful after all events have been fired See docs

Aggregates

Aggregation is a process that ocurrs after all elements have been grabbed.

Property Type Comments
name String
source Function Descriptor object contains all elements grabbed in first step

Saving strategies

  • Data storage GCP
  • Others [pending documentation]

Publish

docker build -t gcr.io/$(gcloud config get-value project)/worker .
gcloud docker -- push gcr.io/$(gcloud config get-value project)/worker

Final notes

  • Finalize your cluster after workload has been finished is trongly recommended to avoid incurring on innecesary charges.
  • If you think this project suits your needs but needs a little tweak send me a message and I´d be happy to talk about it.

About

Parallel ecommerce crawler using Docker and Puppeter on GCP

https://oscarnevarezleal.github.io/ecommerce-crawler/

License:MIT License


Languages

Language:JavaScript 80.1%Language:Dockerfile 19.9%