Serverless DOM scraper
Scrape DOM from website by Chrome after everything is loaded (include JS) - powered by Serverless.
Requirements
Installation
To create a new Serverless project with ES7 support.
$ sls install --url https://github.com/michalsn/sls-chrome-dom --name my-project
Enter the new directory
$ cd my-project
Install the Node.js packages
$ npm install
Usage
Deploy your project
$ sls deploy
To add another function as a new file to your project, simply add the new file and add the reference to serverless.yml
. The webpack.config.js
automatically handles functions in different files.
Endpoints
By default all endpoints returns text/plain
- with "flat" data. Full informations are returned only if we ask for application/json
content type.
dom - GET
Query string parameters:
- url - (required) - URL address
- logBlocked - (optional) - Display blocked resources. Values:
1
or0
- proxy - (optional) - Proxy server address
- proxyUsername - (optional) - Proxy username
- proxyPassword - (optional) - Proxy password
curl -X GET 'https://your-lambda-address-here/dom?url=https://google.com&logBlocked=1' -H 'Content-Type: application/json'
Sample result (Code 200):
{
"status": true,
"data": "some HTML"
"url": "https://www.google.com/",
"blockedContentLog": [
"https://www.google.com/logos/doodles/2018/doodle-snow-games-day-16-5525914497581056.2-s.png",
"https://fonts.gstatic.com/s/roboto/v18/CWB0XYA8bzo0kSThX0UTuA.woff2",
...
]
}
version - GET
curl -X GET 'https://your-lambda-address-here/version' -H 'Content-Type: application/json'
Sample result (Code 200):
{
"version": "HeadlessChrome/69.0.3497.81"
}