benwinding / scrape-reduce

A simple way to scrape websites

scrape-reduce

A simple way to scrape websites, just download the HTML once, and process it as many times as you want.

Get Started

To get going, just clone/fork this repo (or use it as a github template)

git clone git@github.com:benwinding/scrape-reduce.git
cd scrape-reduce
npm install

npm run scrape

Scrapes HTML from the target website
The HTML returned is saved to the scraped directory
Runs the scrape.ts in the src directory
- Provide your own fetch method etc...
- This caches based on the ID provided for each page
Requests are limited to 3 concurrent requests, by default

npm run reduce

Transforms the local HTML into what ever you need
The text returned is saved to reduced
Runs the reduce.ts in the src directory
- You can read the DOM here and find elements etc...

Features

Avoids downloading too often, only scrape when you need to
Caching means the scrape can be interrupted, and resumed
You can iterate quickly with reduce, without network calls to the target site

About

A simple way to scrape websites

MIT License

Languages

Language:TypeScript 100.0%