Archive Web Site

Question

Archive Web Site

sirinath opened this issue 6 years ago · comments

Suminda Sirinath Salpitikorala Dharmasena commented 6 years ago

Can you add the ability to archive a complete web site

spidering from a given directory to any depth or a specified depth
up to a certain depth for links outside the site

Some files may be document files like doc, pdf with links.

Mohamed Aturban · Answer 1 · Thu Dec 06 2018 01:42:54 GMT+0800 (China Standard Time)

Hi Sirinath,

Pushing a web site is kind of tricky because it requires ArchiveNow to:
(1) Download the site locally using some sort of crawlers
(2) Extract all URIs of web pages in the site
(3) Push those URIs into archives

Although it is doable, it might result in sending too many requests to the archive.

Initially, what you can do is to download a site into local WARC file(s) using Wget or even better using Squidwarc which can discover more resources after executing JS. Then, extract all URIs of web pages from the WARC file(s) and finally submit those URIs one by one to archives using ArchiveNow.

This idea is mainly suggested by @machawk1

Best,

Mohamed

Suminda Sirinath Salpitikorala Dharmasena · Answer 2 · Thu Dec 06 2018 18:23:27 GMT+0800 (China Standard Time)

I believe this could be done if there is an integration example handler with Scrapy which can be customised by the user. This can even be hosted in Scrapinghub where a simple job can be done to do the pushing without even having to run it locally.

Suminda Sirinath Salpitikorala Dharmasena · Answer 3 · Thu Dec 06 2018 18:26:14 GMT+0800 (China Standard Time)

For PDF and Doc processing I found:

Mat Kelly · Answer 4 · Thu Dec 06 2018 23:52:20 GMT+0800 (China Standard Time)

As we discussed @maturban, it may come down to:

Discoverability of URIs that constitute a "complete website"
The ability to surface additional URIs of embedded resources
Mitigating the inevitable throttling that will occur when attempt to submit many URIs to archives at a reasonable pace.

#⁠2 would benefit from a browser-based system as you referenced with Squidwarc, but the overhead of generating WARCs from this content seems like an unnecessary burden for someone wanting to submit URIs.

I have not used Scrapy in a while, as suggested by @sirinath, but its relative capability of rendering the page (w/ regard to JS) will likely hinder the completeness of the set of URIs, individual pages, and thus complete web sites.

I also recall there being policies from some archives as to what sort of content-types they retain, e.g., does IA allow submission of URIs of DOCs and PDFs?