tsmith512 / brandyshots

Work in progress script to automate crawling a website for screenshots to archive. Written for Brandy!

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Brandy Shots

A tool to automatically crawl capture screenshots of an entire website. Written for a friend, for a thing we used to have to do a lot. Maybe you will find it useful, too.

Installation

  • I recommend nvm for Node version management
  • nvm use (or nvm install && nvm use) to get/activate the correct Node version.
  • npm install

Usage

node craw.js [options] [home page]

  • Example: node crawl.js -r -l https://www.biocomppharma.com

Options

  • -r Generate a report of all URLs captured in a text file.
  • -l Generate a list of links encountered during the crawl.
  • -h Show a list of options and exit.
  • home page should include a protocol and be the starting point of a crawl, usually a site root.

Output

Script will dump the reports (if requested) and PNGs into a directory called output in the current working directory. The user should manually consolidate these images into a PDF at max image quality for archiving.

Future enhancements:

This is quick-n-dirty in a lot of ways. Things I'd like to do:

  • Automatically combine images and reports into a PDF in output instead of that being a manual step.
  • A lot of sites will need some kind of filter to remove things like text-size adjustment links, lightbox links, social media share/profile or outbound redirector links. There's a place in crawl.js noting where to put these filters, but they should be spec'able on the CLI.
  • Should offload the console.log debug messages to a debugger or better output control.

About

Work in progress script to automate crawling a website for screenshots to archive. Written for Brandy!


Languages

Language:JavaScript 100.0%