maximebories / regexp-scraper

Advanced used of Puppeteer to scrape a web engine results against a RegExp

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

regexp-scraper

Advanced use of Puppeteer to crawl Google web engine results against a RegExp. You have to provide a search query and a regular expression to match against either the Google search results page or the full page content depending on how thorough you want the search to be.

To get a lock on node modules:

$ npm update

To run:

$ node main.ts

The example I used here was to find fraud phishing URLs send through text messages for further investigations, DO NOT click on any of them unless you know what you are doing.

How to use the script

To run the script, use the following command:

$ node main.ts <query> <regexp> <filter>

Replace with the query you want to use for the search, with the regular expression you want to use to match against the page content, and with 'true' if you want to filter the search results or 'false' if you don't want to filter the results.

For example, to perform a search for the query 'Votre colis a été envoyé. Veuillez le vérifier et le recevoir.' which is a common text phishing in France, disabling filtering Google similar results search, the results and using the regular expression 'http://[a-z]{5}.[a-z]{5}.[a-z]+' that capture all the URLs that are being used in this fishing operation, use the following command:

$ node main.ts 'Votre colis a été envoyé. Veuillez le vérifier et le recevoir.' 'http://[a-z]{5}.[a-z]{5}.[a-z]+' false

About

Advanced used of Puppeteer to scrape a web engine results against a RegExp


Languages

Language:TypeScript 100.0%