aws-lambda chrome dom proxy scraper serverless

Serverless DOM scraper

Scrape DOM from website by Chrome after everything is loaded (include JS) - powered by Serverless.

Requirements

Installation

To create a new Serverless project with ES7 support.

$ sls install --url https://github.com/michalsn/sls-chrome-dom --name my-project

Enter the new directory

$ cd my-project

Install the Node.js packages

$ npm install

Usage

Deploy your project

$ sls deploy

To add another function as a new file to your project, simply add the new file and add the reference to serverless.yml. The webpack.config.js automatically handles functions in different files.

Endpoints

By default all endpoints returns text/plain - with "flat" data. Full informations are returned only if we ask for application/json content type.

dom - GET

Query string parameters:

url - (required) - URL address
logBlocked - (optional) - Display blocked resources. Values: 1 or 0
proxy - (optional) - Proxy server address
proxyUsername - (optional) - Proxy username
proxyPassword - (optional) - Proxy password

curl -X GET 'https://your-lambda-address-here/dom?url=https://google.com&logBlocked=1' -H 'Content-Type: application/json'

Sample result (Code 200):

{
    "status": true,
    "data": "some HTML"
    "url": "https://www.google.com/",
    "blockedContentLog": [
        "https://www.google.com/logos/doodles/2018/doodle-snow-games-day-16-5525914497581056.2-s.png",
        "https://fonts.gstatic.com/s/roboto/v18/CWB0XYA8bzo0kSThX0UTuA.woff2",
        ...
    ]
}

version - GET

curl -X GET 'https://your-lambda-address-here/version' -H 'Content-Type: application/json'

Sample result (Code 200):

{
    "version": "HeadlessChrome/69.0.3497.81"
}

Thanks

About

Serverless DOM scraper

aws-lambda chrome dom proxy scraper serverless

MIT License

Languages

Language:JavaScript 100.0%