oduwsdl / archivenow

A Tool To Push Web Resources Into Web Archives

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Archive sites in addition to submitting URIs

machawk1 opened this issue · comments

One of the use cases in https://github.com/webrecorder/warcit is to grab a site's contents using wget then running the tool to create a WARC file from the local file contents. It would be useful for a tool called, "archivenow" to do more than submit URIs, rather, to perform some form of archiving itself.

I would like to propose replicating this model from the archivenow tool but in a single command. For example, running archivenow --warc=news.warc --agent=wget --ia http://cnn.com would use wget to create a WARC of cnn.com and store it locally at news.arc but also submit the URI to IA.

It is really nice to have "archivenow" create WARCs locally, not just pushing URLs to other archives. It is like pushing URLs into local archive in addition to the remote ones. I will definitely implement this as soon I can.
Because this is written in Python, I would suggest using the module "requests" or any other Python module instead of "wget"! what do you think?

You will need to chase down all of the embedded resources w/ requests. wget does this for you and has native support for WARC output. If there was a Python equivalent of @N0taN3rd's https://github.com/n0tan3rd/node-warc, that would work well, too.

For Python side of controlling chrome without handling the raw websockets

I am closing this as we already included creating WARCs by Wget and Squidwarc