Web Crawler

Generates a site map of unique pages linked within the same domain, along with their mime-type - may use x-ray to scrape more. Places JSON jsonTrees in sitemaps/ folder.

Web Crawler Code app.js and scraper.js from https://github.com/Casper-Oakley/web-scraper .

Favicon Grabber

Started as a scraper - now uses google service - expects a png for now... Places images in data/fav/

Visualization

D3 force directed graph. Some munging required to get the data in the correct format - run 'node treeToGraph.js' to generate the treeGraph.json as the data set for vis.html. Serve visualizations via : 'node serve.js' then hit localhost:8080/force.html , vis.html etc. Various JSON fixtures in data/ folder for d3 experimentation.

Prerequisites

NodeJS (recommended 6+) and NPM

Installation

Normally, to install: npm install

But this repository currently has all the dependencies in node_modules so it should just work if cloned.

Usage

Simply run with nodejs app.js and put in the URL of the website you want to target. This currently only supports URLs in the format 'http://www.lmgtfy.com' and may err otherwise. The app scrapes the site, then runs the favicon grabber and the treeToGraph.js script on the sitemap to generate a graph for D3. Then a simple webserver is launched. You can access the visualization at localhost:8080/gravity?file=filename localhost:8080/force4?file=filename localhost:8080/force3?file=filename localhost:8080/force2?file=filename localhost:8080/force?file=filename where filename is the name returned by the script in the console. I haven't been able to exercise strict control over the async processes - I'd like to use queue for that. If the program crashes, try setting the depth limit from 3 to 2 on line 28 of app.js .

linguistbreaker / crawler-force-layout