Archiving resources with relative Content-Location
csarven opened this issue · comments
archivenow --ia https://www.w3.org/TR/webarch/
https://web.archive.orgOverview.html
See also from curl where a resource returns Content-Location
:
curl -I https://www.w3.org/TR/webarch/
content-location: Overview.html
in comparison to the ones that don't:
curl -I http://csarven.ca/
So, when I do something like:
curl -ki 'https://web.archive.org/save/https://www.w3.org/TR/webarch/' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:69.0) Gecko/20100101 Firefox/69.0' -H 'Accept: */*' -H 'Accept-Language: en-CA,en;q=0.7,en-US;q=0.3' --compressed -H 'Referer: https://localhost:8443/' -H 'Origin: https://localhost:8443' -H 'Connection: keep-alive' -H 'DNT: 1' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache'
I get:
content-location: Overview.html
And that kind of screws up things for me because I can't figure out the actual snapshot location from the headers. Okay if JS-enabled agent is making the request because it eventually redirects.. but that's not what I want because I'm making this call from a client-side application and only want to work with headers (or whatever is proper structured data is available.. as opposed to scraping stuff).
This is in comparison to say:
curl -ki 'https://web.archive.org/save/http://csarven.ca/' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:69.0) Gecko/20100101 Firefox/69.0' -H 'Accept: */*' -H 'Accept-Language: en-CA,en;q=0.7,en-US;q=0.3' --compressed -H 'Referer: https://localhost:8443/' -H 'Origin: https://localhost:8443' -H 'Connection: keep-alive' -H 'DNT: 1' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache'
which gives a nice workable:
content-location: /web/20190708123256/http://csarven.ca/
Ideas?
Hi Sarven,
This should be fixed now:
archivenow --ia https://www.w3.org/TR/webarch/
https://web.archive.org/web/20190708193152/https://www.w3.org/TR/webarch/
Also for the other URI:
archivenow --ia http://csarven.ca/
https://web.archive.org/web/20190708194236/http://csarven.ca/
You are right JavaScript makes the request redirect, but I was able to extract the location of the snapshot from the returned HTML.
The Github and PyPi repos have been updated.
archivenow --version
ArchiveNow 2019.7.8.4.6.30
Please, let me know if the issue is still occurring.
I can confirm that the update provides the snapshot URL.
I was hoping that scraping the HTML wouldn't be required. Using the redirUrl
JavaScript line is fragile but I guess that's the only way it will work until IA updates their Content-Location
for this particular case.