oduwsdl / archivenow

A Tool To Push Web Resources Into Web Archives

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Archive Web Site

sirinath opened this issue · comments

Can you add the ability to archive a complete web site

  • spidering from a given directory to any depth or a specified depth
  • up to a certain depth for links outside the site

Some files may be document files like doc, pdf with links.

Hi Sirinath,

Pushing a web site is kind of tricky because it requires ArchiveNow to:
(1) Download the site locally using some sort of crawlers
(2) Extract all URIs of web pages in the site
(3) Push those URIs into archives

Although it is doable, it might result in sending too many requests to the archive.

Initially, what you can do is to download a site into local WARC file(s) using Wget or even better using Squidwarc which can discover more resources after executing JS. Then, extract all URIs of web pages from the WARC file(s) and finally submit those URIs one by one to archives using ArchiveNow.

This idea is mainly suggested by @machawk1

Best,

Mohamed

I believe this could be done if there is an integration example handler with Scrapy which can be customised by the user. This can even be hosted in Scrapinghub where a simple job can be done to do the pushing without even having to run it locally.

As we discussed @maturban, it may come down to:

  1. Discoverability of URIs that constitute a "complete website"
  2. The ability to surface additional URIs of embedded resources
  3. Mitigating the inevitable throttling that will occur when attempt to submit many URIs to archives at a reasonable pace.

#⁠2 would benefit from a browser-based system as you referenced with Squidwarc, but the overhead of generating WARCs from this content seems like an unnecessary burden for someone wanting to submit URIs.

I have not used Scrapy in a while, as suggested by @sirinath, but its relative capability of rendering the page (w/ regard to JS) will likely hinder the completeness of the set of URIs, individual pages, and thus complete web sites.

I also recall there being policies from some archives as to what sort of content-types they retain, e.g., does IA allow submission of URIs of DOCs and PDFs?