ualbertalib / node-warc-proxy

Simple node.js server to allow navigation of the contents of a WARC file

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

node-warc-proxy

Simple node.js server to allow navigation of the contents of a WARC file

Requirements

Sample warc used in testing: drupalib.interoperating.info.warc.gz

To run

  • Copy drupalib.interoperating.info.warc.gz to directory ../warc (relative to directory where warcnode.js is installed); or elsewhere
  • gunzip drupalib.interoperating.info.warc.gz
  • generate the csv index (in the same directory as drupalib.interoperating.info.warc.gz):
warcindex drupalib.interoperating.info.warc > drupalib.interoperating.info.warc.csv
  • in the directory with warcnode.js:
node warcnode.js --warc ../warc/drupalib.interoperating.info.warc

(or substitute the path to your warc)

Note

  • drupalib.interoperating.info.warc does not contain all the files that are linked in the html - notably, the /themes/ directory is absent. 404 errors are returned for these requests.

TODO

  • diagnose problem that causes truncated html sometimes

About

Simple node.js server to allow navigation of the contents of a WARC file


Languages

Language:JavaScript 100.0%