Simple Demo Web Scraper/Crawling API on NodeJS and Puppeteer

This "Web Scraper/Crawling API" provide a WEB-server (Express) with API to fetch a web-page using headless Chrome browser (puppeteer-extra) trought specified proxy (in the .env file). This can be useful if you need a Javascript evalution before fetching resulting HTML (thats not possible with curl/wget) For an every proxy app starts separate puppeteer/browser instance and opens url-request in the page/tabs in its.

How to use:

git clone this project
apt install --no-install-recommends ca-certificates fonts-liberation libappindicator3-1 libasound2 libatk-bridge2.0-0 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgbm1 libgcc1 libglib2.0-0 libgtk-3-0 libnspr4 libnss3 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 lsb-release wget xdg-utils
npm install
mv .env.example .env
change PROXIES and EXPRESSPORT as needed in the .env file
npm start
navigate your browser to url http://localhost:EXPRESSPORT/chrome?url=http://leserged.online.fr/phpinfo.php&proxyport=12000&headers=%7B%22Header1%22%3A+%22HeaderValue1%22%2C%22Header2%22%3A+%22HeaderValue2%22%7D&cookies=%7B%22Cookie1%22%3A+%22CookieVal1%22%2C%22Cookie2%22%3A+%22CookieVal2%22%7D

To Do:

Add support for passing cookies = specify cookies in JSON in the cookies GET-parameter
Add support for passing headers = specify headers in JSON in the headers GET-parameter
Add support for retreiveng cookies
Add support for retreiveng headers

Thanks to https://www.digitalocean.com/community/tutorials/how-to-scrape-a-website-using-node-js-and-puppeteer

Simple Demo Forward HTTP Proxy on NodeJS and Puppeteer

Simple forward HTTP proxy for NodeJS with passing request with headers to headless Google Chrome/Puppeteer and return page content, response status/code For an every upstream proxy app starts separate puppeteer/browser instance and opens url-request in the page/tabs in its.

How to use:

git clone this project
apt install --no-install-recommends ca-certificates fonts-liberation libappindicator3-1 libasound2 libatk-bridge2.0-0 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgbm1 libgcc1 libglib2.0-0 libgtk-3-0 libnspr4 libnss3 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 lsb-release wget xdg-utils
npm install
mv .env.example .env
change PROXIES and EXPRESSPORT as needed in the .env file
node proxyserv.js
use curl or browser, specify proxy to localhost:EXPRESSPORT and navigate to any HTTP (HTTPS is not supported now) web-page. Example: curl --proxy localhost:EXPRESSPORT http://example.com -v

Notes: HTTPS URLs is not supported, need to deal with CONNECT and TLS

Thanks to https://dev.to/nimit95/a-simple-http-https-proxy-in-node-js-3810

alt-dima / pupproxy

Simple Demo Web Scraper/Crawling API on NodeJS and Puppeteer

Simple Demo Forward HTTP Proxy on NodeJS and Puppeteer

About

Languages