Quantium / Scraper2

NodeJS based Scraper

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Scraper2

How it works

The scraping functionality has a simple structure based on 3 absctract clases:

  • A Scraping class extended from [src/scrapers/AbstractScrap.js]
  • A Doc class extended from [src/docs/AbstractDoc.js]
  • A Product class extended from [src/products/AbstractProduct.js]

Scraping class

A scraping class opens a set of urls related with the EANs of the products. For now it's no necesary to implement this functionality using the name or the salt of the product (In case it's a medicine). All the EANS come from the producto-prixz collection in the prixz Mongo database.

To create a new Scrap class it's only needed to implement the AbstractScrap one and set the following protected variables:

  • _docClass: The class of the imported class that implements AbstractDoc
  • name_: The name that will identified the calls of that specific scraper
  • _collectionWriteObj(price): Function that creates the object provided for MongoDB to upsert the price in the realted ean

A scraping class emits an end event when all the EANS in the collection were passed.

The Scraper uses a Document class to know how to manage that url, go for the price and get it

Document class

A Document class emits a ready event when the content is ready

To create a Document class only implement the AbstractDoc one and implement any way to get the price of that ean. Right now, all the Docs classes set the file constant searchURL to an static one and concatenate it with the ean. That's because all the site scraped respond to EAN search directly and uses GET variables to do it, generating a url with the form **http(s)//(Website)(some query variable)(the ean number of the product), this are some examples:

Notice the pattern here. All the searches gets Aspirina as product and gets in the same page the price, so the scraping is easy in this examples.

You should implement a different Doc class if you want to scrap from a more complex website or if you want to use an api, webservice, databse or any other method to obtain the prices

Product class

A product class is optional for scraping, but it is the core of the actual scraping functionality. It's used only to get the price from the given html content. Usually, a Document class pass the raw html getted from the resulting website to the Product class and this one gets the price. Usually this kind of class uses cheerio or other jquery-like queries engine and/or regular expressions to get the right price.block

The prices must be returned in float format.block

To create a Product class you should override the price getter to analize the content an return the price

Installation

Just run

npm install

Settings file

There is an settings file example with all you need to know to setup a real settings.js file. For security reasons never share any settings.js file with anyone; every settings.js file must be kept only inside the server or machine with its related environment.

The settins file is a module that exports a single function that receives no parameters and returns a simple object. That was the best way to load a config file avoiding loading time and parsing.

module.exports = () => {
    return {
        mongo: {
            //For more information about connection string in mongo go to https://docs.mongodb.com/manual/reference/connection-string/
            url: 'mongodb://<user>:<password>@<url>:<port>,<replica_url>:<replica_port>/<database>?<options>',
            //The name of the Mongo collection where the products EAN will be used
            productCollection: 'producto-prixz',
            //The name of the collection to deposit the results of the scraping process
            scraperCollection: 'scraper'
        }
    };
};

Running

The best way to run this project is using pm2 with the following command:

pm2 start index.js --name Scraper2 --time --watch

The project if full of process.exit(1) everywhere in order to end the process hoping that pm2 restart it again

About

NodeJS based Scraper


Languages

Language:JavaScript 89.3%Language:HTML 10.7%