oduwsdl / archivenow

A Tool To Push Web Resources Into Web Archives

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Archiving resources with relative Content-Location

csarven opened this issue · comments

archivenow --ia https://www.w3.org/TR/webarch/
https://web.archive.orgOverview.html

See also from curl where a resource returns Content-Location:

curl -I https://www.w3.org/TR/webarch/
content-location: Overview.html

in comparison to the ones that don't:

curl -I http://csarven.ca/

So, when I do something like:

curl -ki 'https://web.archive.org/save/https://www.w3.org/TR/webarch/' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:69.0) Gecko/20100101 Firefox/69.0' -H 'Accept: */*' -H 'Accept-Language: en-CA,en;q=0.7,en-US;q=0.3' --compressed -H 'Referer: https://localhost:8443/' -H 'Origin: https://localhost:8443' -H 'Connection: keep-alive' -H 'DNT: 1' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache'

I get:

content-location: Overview.html

And that kind of screws up things for me because I can't figure out the actual snapshot location from the headers. Okay if JS-enabled agent is making the request because it eventually redirects.. but that's not what I want because I'm making this call from a client-side application and only want to work with headers (or whatever is proper structured data is available.. as opposed to scraping stuff).

This is in comparison to say:

curl -ki 'https://web.archive.org/save/http://csarven.ca/' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:69.0) Gecko/20100101 Firefox/69.0' -H 'Accept: */*' -H 'Accept-Language: en-CA,en;q=0.7,en-US;q=0.3' --compressed -H 'Referer: https://localhost:8443/' -H 'Origin: https://localhost:8443' -H 'Connection: keep-alive' -H 'DNT: 1' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache'

which gives a nice workable:

content-location: /web/20190708123256/http://csarven.ca/

Ideas?

Hi Sarven,

This should be fixed now:

archivenow --ia https://www.w3.org/TR/webarch/
https://web.archive.org/web/20190708193152/https://www.w3.org/TR/webarch/

Also for the other URI:

archivenow --ia http://csarven.ca/
https://web.archive.org/web/20190708194236/http://csarven.ca/

You are right JavaScript makes the request redirect, but I was able to extract the location of the snapshot from the returned HTML.

The Github and PyPi repos have been updated.

archivenow --version
ArchiveNow 2019.7.8.4.6.30

Please, let me know if the issue is still occurring.

I can confirm that the update provides the snapshot URL.

I was hoping that scraping the HTML wouldn't be required. Using the redirUrl JavaScript line is fragile but I guess that's the only way it will work until IA updates their Content-Location for this particular case.