oduwsdl / archivenow

A Tool To Push Web Resources Into Web Archives

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Archive images in IA

jc86035 opened this issue · comments

It would be nice if the tool would also archive embedded content for Internet Archive requests. This could be done by downloading the archived page and searching for any /save/_embed/[^"'<>\(\)]* URLs in the page source.

(It would also be nice if the tool could download lazy-loaded files and/or any linked media files, although even the Wayback Machine can't really do that in many cases.)

I have seen those links to embedded resources before. The Internet Archive will archive those resources when the archived page is reloaded, and, in other cases, even though you see those links in the returned HTML code, the Internet Archive has already placed them in a queue for crawling, so next time the page is reloaded/revisited, those resources will be already captured, and they will be served from the archive. So I think no need to add this feature to ArchiveNow right now.

The other issue is that some links to web resources (e.g., images) are generated by Javascript and they are totally different each time you reload the archived page. For example, each time you reload http://web.archive.org/web/20190107013706/http://ws-dl.blogspot.com in the browser, you will get unique links that have not been archived yet, again, because those links contains some unique random values (e.g., the link that ends with slideshare.net/fizzy/admin?...)

@maturban, why did you close this?

The issue with not downloading the images is that by the time someone actually opens the page (which may not happen if the user just assumes the archive is fine), the embedded content may already have been modified or have disappeared. Some images may change daily or even more frequently.

I have personally archived more than 107 pages to IA using similar (but less sophisticated) methods, but have had to either download every page and then separately archive all the images, or not archive the images at all. I've written a script which saves embedded content, but it's basically just cd tmp; cat $1 | xargs -P 5 wget --spider --retry-connrefused ; grep -hro [...] tmp | awk '!seen[$0]++' | xargs -P 5 wget [...]. I also wrote a different script for YouTube because of its lazy loading – in some cases (e.g. Buzzfeed) it could be beneficial to save images that Wayback doesn't know how to archive or how to display.

Thanks for the information. Could you please give one or more examples of such pages with images that when submitted to the Internet Archive, the returned response may have URIs with .../save/_embed/...?

One solution to this issue would be to load the returned URI-M in a headless web browser, which will automatically trigger requests to archive all the embedded resources.

the returned response may have URIs with .../save/_embed/...

I had thought this was always the case for pages being saved (.../save/, but not .../web/) with embedded content from another address used through the src HTML attribute.

$ curl -s "https://web.archive.org/save/https://en.wikipedia.org/wiki/Main_Page" | grep -o '/save/_embed/[^"<>()]*'

/save/_embed/https://en.wikipedia.org/w/load.php?debug=false&amp;lang=en&amp;modules=ext.3d.styles%7Cext.uls.interlanguage%7Cext.visualEditor.desktopArticleTarget.noscript%7Cext.wikimediaBadges%7Cmediawiki.legacy.commonPrint%2Cshared%7Cmediawiki.skinning.interface%7Cskins.vector.styles&amp;only=styles&amp;skin=vector
/save/_embed/https://en.wikipedia.org/w/load.php?debug=false&amp;lang=en&amp;modules=startup&amp;only=scripts&amp;skin=vector
/save/_embed/https://en.wikipedia.org/w/load.php?debug=false&amp;lang=en&amp;modules=ext.gadget.charinsert-styles&amp;only=styles&amp;skin=vector
/save/_embed/https://en.wikipedia.org/w/load.php?debug=false&amp;lang=en&amp;modules=site.styles&amp;only=styles&amp;skin=vector
/save/_embed/https://en.wikipedia.org/static/favicon/wikipedia.ico
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/Corporation_of_London_Records_Office%2C_Plea_and_Memoranda_Roll_A34%2C_m.2_%281395%29.jpg/120px-Corporation_of_London_Records_Office%2C_Plea_and_Memoranda_Roll_A34%2C_m.2_%281395%29.jpg
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/7/79/Sunset_Parade_-_US_Marin_Corps.jpg/180px-Sunset_Parade_-_US_Marin_Corps.jpg
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/0/0b/Batholomew_handing_tomos_to_Epiphanius.jpg/162px-Batholomew_handing_tomos_to_Epiphanius.jpg
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Constructing_the_Metropolitan_Railway.png/174px-Constructing_the_Metropolitan_Railway.png
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/b/b8/John_Quidor_-_The_Headless_Horseman_Pursuing_Ichabod_Crane_-_Google_Art_Project.jpg/550px-John_Quidor_-_The_Headless_Horseman_Pursuing_Ichabod_Crane_-_Google_Art_Project.jpg
/save/_embed/https://upload.wikimedia.org/wikipedia/en/thumb/4/4a/Commons-logo.svg/31px-Commons-logo.svg.png
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/3/3d/Mediawiki-logo.png/35px-Mediawiki-logo.png
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/7/75/Wikimedia_Community_Logo.svg/35px-Wikimedia_Community_Logo.svg.png
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Wikibooks-logo.svg/35px-Wikibooks-logo.svg.png
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/f/ff/Wikidata-logo.svg/47px-Wikidata-logo.svg.png
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/2/24/Wikinews-logo.svg/51px-Wikinews-logo.svg.png
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Wikiquote-logo.svg/35px-Wikiquote-logo.svg.png
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Wikisource-logo.svg/35px-Wikisource-logo.svg.png
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/d/df/Wikispecies-logo.svg/35px-Wikispecies-logo.svg.png
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/0/0b/Wikiversity_logo_2017.svg/41px-Wikiversity_logo_2017.svg.png
/save/_embed/https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Wikivoyage-Logo-v3-icon.svg/35px-Wikivoyage-Logo-v3-icon.svg.png
/save/_embed/https://upload.wikimedia.org/wikipedia/en/thumb/0/06/Wiktionary-logo-v2.svg/35px-Wiktionary-logo-v2.svg.png
/save/_embed/https://en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1
/save/_embed/https://en.wikipedia.org/static/images/wikimedia-button.png
/save/_embed/https://en.wikipedia.org/static/images/poweredby_mediawiki_88x31.png

I think this is not an issue. These /save/_embed URIs are temporary and go away in the next few second/minutes. I archived a Wikipedia page and saved the immediate response into a file locally. This files contained a handful of embed URIs. Then within a minute I downloaded the recently archived memento using cURL (not a web browser to avoid any implicit save requests) and found no embed URIs in it. This means while in the immediate response other resources are still in the frontier queue, the server rewrites those links differently, but in the next few minutes those queued resources should be archived.

but in the next few minutes those queued resources should be archived

Does it work like that? I don't think the server does that. I thought it just did redirect magic with /web/ URLs so that all of the links work.

curl -s "https://web.archive.org/web/20190110133207/https://en.wikipedia.org/wiki/Main_Page" | grep -o '/web/[^"<>()]*\.jpg'

/web/20190110133207im_/https://upload.wikimedia.org/wikipedia/commons/c/cf/Corporation_of_London_Records_Office%2C_Plea_and_Memoranda_Roll_A34%2C_m.2_%281395%29.jpg
/web/20190110133207im_/https://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/Corporation_of_London_Records_Office%2C_Plea_and_Memoranda_Roll_A34%2C_m.2_%281395%29.jpg/120px-Corporation_of_London_Records_Office%2C_Plea_and_Memoranda_Roll_A34%2C_m.2_%281395%29.jpg
/web/20190110133207im_/https://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/Corporation_of_London_Records_Office%2C_Plea_and_Memoranda_Roll_A34%2C_m.2_%281395%29.jpg/180px-Corporation_of_London_Records_Office%2C_Plea_and_Memoranda_Roll_A34%2C_m.2_%281395%29.jpg 1.5x, //web.archive.org/web/20190110133207im_/https://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/Corporation_of_London_Records_Office%2C_Plea_and_Memoranda_Roll_A34%2C_m.2_%281395%29.jpg/240px-Corporation_of_London_Records_Office%2C_Plea_and_Memoranda_Roll_A34%2C_m.2_%281395%29.jpg
/web/20190110133207/https://en.wikipedia.org/wiki/File:Sunset_Parade_-_US_Marin_Corps.jpg
/web/20190110133207im_/https://upload.wikimedia.org/wikipedia/commons/thumb/7/79/Sunset_Parade_-_US_Marin_Corps.jpg/180px-Sunset_Parade_-_US_Marin_Corps.jpg
/web/20190110133207im_/https://upload.wikimedia.org/wikipedia/commons/thumb/7/79/Sunset_Parade_-_US_Marin_Corps.jpg/270px-Sunset_Parade_-_US_Marin_Corps.jpg 1.5x, //web.archive.org/web/20190110133207im_/https://upload.wikimedia.org/wikipedia/commons/thumb/7/79/Sunset_Parade_-_US_Marin_Corps.jpg/360px-Sunset_Parade_-_US_Marin_Corps.jpg
/web/20190110133207/https://en.wikipedia.org/wiki/File:Batholomew_handing_tomos_to_Epiphanius.jpg
/web/20190110133207im_/https://upload.wikimedia.org/wikipedia/commons/thumb/0/0b/Batholomew_handing_tomos_to_Epiphanius.jpg/162px-Batholomew_handing_tomos_to_Epiphanius.jpg
/web/20190110133207im_/https://upload.wikimedia.org/wikipedia/commons/thumb/0/0b/Batholomew_handing_tomos_to_Epiphanius.jpg/243px-Batholomew_handing_tomos_to_Epiphanius.jpg 1.5x, //web.archive.org/web/20190110133207im_/https://upload.wikimedia.org/wikipedia/commons/thumb/0/0b/Batholomew_handing_tomos_to_Epiphanius.jpg/324px-Batholomew_handing_tomos_to_Epiphanius.jpg
/web/20190110133207/https://en.wikipedia.org/wiki/File:John_Quidor_-_The_Headless_Horseman_Pursuing_Ichabod_Crane_-_Google_Art_Project.jpg
/web/20190110133207im_/https://upload.wikimedia.org/wikipedia/commons/thumb/b/b8/John_Quidor_-_The_Headless_Horseman_Pursuing_Ichabod_Crane_-_Google_Art_Project.jpg/550px-John_Quidor_-_The_Headless_Horseman_Pursuing_Ichabod_Crane_-_Google_Art_Project.jpg
/web/20190110133207im_/https://upload.wikimedia.org/wikipedia/commons/thumb/b/b8/John_Quidor_-_The_Headless_Horseman_Pursuing_Ichabod_Crane_-_Google_Art_Project.jpg/825px-John_Quidor_-_The_Headless_Horseman_Pursuing_Ichabod_Crane_-_Google_Art_Project.jpg 1.5x, //web.archive.org/web/20190110133207im_/https://upload.wikimedia.org/wikipedia/commons/thumb/b/b8/John_Quidor_-_The_Headless_Horseman_Pursuing_Ichabod_Crane_-_Google_Art_Project.jpg/1100px-John_Quidor_-_The_Headless_Horseman_Pursuing_Ichabod_Crane_-_Google_Art_Project.jpg

Most of the images shown on the page were archived 35 minutes before that capture.

I've tried using an odd image size for an example image in my Wikipedia sandbox. MediaWiki generates scaled thumbnails from images originally uploaded to the server by users, so it's very likely that the image was never rendered until a few minutes ago.

curl -s "https://web.archive.org/save/https://en.wikipedia.org/wiki/User:Jc86035/sandbox3"
curl -s "https://web.archive.org/web/20190110140927/https://en.wikipedia.org/wiki/User:Jc86035/sandbox3" | grep -o '/web/[^"<>()]*\.png'

/web/20190110140927im_/https://en.wikipedia.org/static/apple-touch/wikipedia.png
/web/20190110140927im_/https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Example.svg/277px-Example.svg.png
/web/20190110140927im_/https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Example.svg/416px-Example.svg.png 1.5x, //web.archive.org/web/2/https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Example.svg/554px-Example.svg.png
/web/20190110140927im_/https://en.wikipedia.org/static/images/wikimedia-button.png
/web/20190110140927im_/https://en.wikipedia.org/static/images/wikimedia-button-1.5x.png 1.5x, /web/20190110140927im_/https://en.wikipedia.org/static/images/wikimedia-button-2x.png
/web/20190110140927im_/https://en.wikipedia.org/static/images/poweredby_mediawiki_88x31.png
/web/20190110140927im_/https://en.wikipedia.org/static/images/poweredby_mediawiki_132x47.png 1.5x, /web/20190110140927im_/https://en.wikipedia.org/static/images/poweredby_mediawiki_176x62.png

The Example.svg.png image links have not been saved yet (277px · 416px · 554px); thus the absence of _embed URLs does not indicate that the Internet Archive has saved the linked embedded content.

Fair enough! If that's the approach they are taking, then headless browser seems to be the way to go.