Xkcd

Question

Xkcd

fazo96 opened this issue 9 years ago · comments

I plan to archive all the comics in http://xkcd.com/

I think i'll use (comicnumber)-(comictitle).png for the image and figure out how to save the alt-text in the png metadata

Please post if you want to keep a copy of the archive or you manage to create it before I do :)

Whyrusleeping · Answer 1 · Thu Sep 17 2015 01:45:10 GMT+0800 (China Standard Time)

@fazo96 how are you going to manage the comics that are dynamic or contain multiple sequential images? Or the map ones that have a larger version available on click?

I would loooooove to have this. But we should make sure that randall is okay with it first, i'm not sure if there are any sort of copyrights involved here. (too bad he doesnt use github, or we could just ping him)

Juan Benet · Answer 2 · Thu Sep 17 2015 01:50:55 GMT+0800 (China Standard Time)

I would loooooove to have this. But we should make sure that randall is okay with it first, i'm not sure if there are any sort of copyrights involved here.

absolutely. thanks for saying this.

looks like everything is released CC-BY-NC

(too bad he doesnt use github, or we could just ping him)

i'm sure has an account. just have to find it \o/

@fazo96

would be great to include a web viewer with the archive.
maybe make a dir for every comic
put the image in both image.png and <original-img-filename> so that we respect his filenames too, but also make them predictably linked
put the alt text in a file, like alt.txt

(Alternatively, mirror the RSS feed?)

Michele Guerini Rocco · Answer 3 · Thu Sep 17 2015 02:06:56 GMT+0800 (China Standard Time)

The title and the alt text could be stored in the png metadata. You can use ImageMagic: see this.

Enrico Fasoli · Answer 4 · Thu Sep 17 2015 02:33:48 GMT+0800 (China Standard Time)

Looks like license is not an issue as long as we provide credit to randall and include a copy of the license.

Also:

Storing title and alt-text in the png metadata looks like the way to go!
@whyrusleeping As far as unconventional comics, we'll figure a solution out for every comic
@jbenet a viewer would be great, but at this point, what do you guys think about including the entire website?

Uhm, I just found this in the About page of xkcd.com:

Is there an interface for automated systems to access comics and metadata?
Yes. You can get comics through the JSON interface, at URLs like http://xkcd.com/info.0.json (current comic) and http://xkcd.com/614/info.0.json (comic #614).

Getting the data will be a lot easier this way (no html parsing involved)

EDIT:

I wrote a node script that downloads and organizes data from xkcd.com and it worked!

I created a partial copy of xkcd.com to see if you like the setup (so that we can create a full copy later). I included Randall's about and license pages and my script in the folder 👍

You can check it out here: QmSeYATNaa2fSR3eMqRD8uXwujVLT2JU9wQvSjCd1Rf8pZ

I'm thinking about writing a simple index.html to include in every comic's folder so that alt-text, image (and transcript) can all be seen comfortably on the same browser tab

David A Roberts · Answer 5 · Fri Sep 18 2015 18:27:11 GMT+0800 (China Standard Time)

👍

a viewer would be great, but at this point, what do you guys think about including the entire website?

I think #7 is also quite relevant here.

Enrico Fasoli · Answer 6 · Sat Sep 26 2015 20:38:03 GMT+0800 (China Standard Time)

I completed the archive (now every image file and more is available via ipfs), it just a needs a viewer and probably better folder structure.

Here you go: QmPVP4sDre9rtYahGvcjv3Fqet3oQyqrH5xS33d4YBVFme

David A Roberts · Answer 7 · Thu Oct 01 2015 16:31:33 GMT+0800 (China Standard Time)

@fazo96 👏

Henry · Answer 8 · Thu Oct 01 2015 21:18:43 GMT+0800 (China Standard Time)

@fazo96 👍

Can we zero pad the numbers on the next pass? :)

Enrico Fasoli · Answer 9 · Fri Oct 02 2015 01:05:45 GMT+0800 (China Standard Time)

@cryptix yeah I figured it was necessary :) if you'd like a try, the script I used to generate the directory tree is included in the directory. It's named xkcd-downloader.js

If I have time I'll implement it

Mateusz Naściszewski · Answer 10 · Thu Aug 11 2016 10:46:13 GMT+0800 (China Standard Time)

I have scraped the entirety of xkcd.com and ~~some of it's subdomains~~ (Apparently cross-subdomain interlinking didn't work), the result is a very well functioning copy, available at the end of this comment.
EDIT:
Instructions for updating the archive:

Download and install HTTrack. (Windows/linux/OSX)
run httrack xkcd.com -d -%F "" -%N1 -n +*.css +*.js +*.png +*.jpg +*.jpeg +*.gif -*.pdf -O $mirror,$cache (or httrack xkcd.com what-if.xkcd.com ... to archive what-if as well)
The command should be done within ~~10 minutes~~ a few hours on a decent link.
There may be some .delayed files in imgs.xkcd.com/comics; they contain proper data but have an invalid name. I have no clean solution, so use this command to fix it up:
cd $mirror/imgs.xkcd.com/comics && ls -1 | awk -F. '/delayed/ {print $0 " " $1".png"}' | xargs -n 2 mv
ipfs add -r $mirror

Switch explanation:

-d - Allow to mirror subdomains (edit: Doesn't seem to work for some reason.)
-%F "" - Disable footer text (by default including timestamp), allowing deduplication of HTML across updates.
-%N1 - Untested, but should fix the 'delayed' files for known file extensions.
-n - Archive resources "near" an HTML file, (scripts, css, images)
+*.css +*.js +*.png +*.jpg +*.jpeg +*.gif - Also archive all css, js, and images seen outside of HTML (included from JS or CSS, for example)
-*.pdf - Don't download external PDFs (when archiving what-if.xkcd.com)
-O $mirror,$cache - The resulting webpage is put into $mirror, while httrack runtime info, logs, caches are put into $cache.
(optional) -%v2 - Add a progress and statistics display during crawl.

Archiving notes:

While HTTrack supports an --update switch, it's broken if the -%F option has an empty argument, so we need to re-crawl the site completely to update.
I don't recommend archiving what-if.xkcd.com using the command above, as for some reason, the crawler enters Wikipedia and downloads way too much.
TODO: Check how well m.xkcd.com archives
TODO: Archive "Hoverboard" game/comic (+ other interactive, if sensible)

Archive links (newest to oldest): (My IPNS entry might be more up to date)

Date	Last Comic	Size	Hash	Notes
2016-09-30	1740	169MB	`QmNogExCdnMJwWE1bpEweMUQyo3X2LP6tuWVvmLYJxUc6o`
2016-09-28	1739	169MB	`QmTGXzCqJNRpKVWmt84oQFmHLiSPQ43JLMpshY95Xkfy1N`
2016-09-26	1738	169MB	`Qmam87KnuC93dVF2PidnDh1KpH8U1V3osWx41tkYCQfont`	Includes uncorrected 1738: Moon Shapes image
2016-09-21	1736	169MB	`QmZR6JT1nnNdcBcPjnA4GfT3uqRsHmYrg2fKWrT2BEiTmk`
2016-09-21	1735	168MB	`QmRtXAxyXHWA5krxXMrRJHKJ5qFYXpsz48htquiHp9KbUs`
2016-09-15	1733	167MB	`QmTfagPa7QTtpcZVLYSsKBNMSZz4ytSwStAsNX5mJXhyEF`
2016-09-12	1732	167MB	`QmWoJ5aLozwkNuPQh7RSX7RCn5eXLRwSQezxNBNKsWxsc2`
2016-09-09	1731	166MB	`QmZLdQQJHMCZFZ8jVSSwpmeGfHw6fF2V2QBzo2SJVerWHN`
2016-09-05	1729	166MB	`QmXzDGjRT7McpuLHfRP42ST6bZbjX2KvDGxAZ68gXFdbBz`
2016-09-02	1728	166MB	`Qmc5MG1kL2rR5PNVqr7uqZKAi7g7FcRgvf8mPxtWhb3tNp`
2016-09-01	1727	166MB	`QmTz7tvjVCYz5GPN3YZYNQHadbrSmSzCLp4RWKrh664pJL`
2016-08-29	1726	166MB	`QmRZGA4dMVn13acXQsQL32c8ANcpLSsuTMkx5t98y5oeJL`
2016-08-26	1725	165MB	`QmbLgCaps5oiEh1KSBcnAXAado3tVmpocqsgshVrU2jLoR`
2016-08-24	1724	166MB	`QmNvQQwupNbfUkkTvGSxSyoDjC1WVbEob6NhzVh9qFydCR`
2016-08-23	1723	165MB	`QmRrbEHYyDLSF3d7ghVSpRS2TqrLRhkFxXCsJSFBjuSaCs`
2016-08-19	1722	165MB	`QmdDtTn5W1cQyKDQjubVwEACazjhhP2f7VaNew5bZaBsk7`
2016-08-17	1721	164MB	`QmPMrtopMKBmsW2AMtNNkvdYu9VoHAEwAYgWTFKZgNNqe2`
2016-08-15	1720	165MB	`QmXfn9kftq3DNHPoEbwonYdmEdwyH6BRENMCHscGXaymRm`
2016-08-13	1719	165MB	`QmZJHTHXjGnZN4FtrxzZprNtSyFRg8x9t2pLpuE2jjrzad`	Fix `.delayed` files causing some comics to be broken.
2016-08-13	1719	165MB	`QmauMY4ux6jQVkGmphhzWwULasZ4RPMYxvWkppGv1ZpAL3`
2016-08-11	1718	172MB	`QmbcvivamWCKUuQjdTbCHNBy74qehU6uWTCdyBw3sN8X6b`	this archive unfortunately includes the cache folder.

Jakub Sztandera · Answer 11 · Tue Feb 07 2017 22:18:57 GMT+0800 (China Standard Time)

Looks like the currently referenced version on the website isn't fully available.

John Reed · Answer 12 · Thu Feb 09 2017 03:37:52 GMT+0800 (China Standard Time)

@fazo96 do you have the original archive that's currently linked to on the archives.ipfs.io site? https://ipfs.io/ipfs/QmPVP4sDre9rtYahGvcjv3Fqet3oQyqrH5xS33d4YBVFme

It doesn't currently seem to be fully available, but if you still have it I can pin it to my ipfs node. I'd try to reproduce the archive using the script in the archive, but I could only guess what the exact text was in the about and license files.

John Reed · Answer 13 · Sat Feb 11 2017 05:36:31 GMT+0800 (China Standard Time)

FWIW I just generated a new version of fazo96's archive that's linked to from the site and pinned it to my ipfs node, so the comics that I couldn't access through the gateway before (in the archive linked to from archives.ipfs.io) now seem to be accessible. The about and license files still seem to be unavailable, but I just added the relevant pages from the website to the version of the archive I just created.

Qmb8wsGZNXt5VXZh1pEmYynjB6Euqpq3HYyeAdw2vScTkQ

Deleted user · Answer 14 · Sat Feb 11 2017 05:41:10 GMT+0800 (China Standard Time)

Awesome, gonna pull that onto one of our storage nodes too. @leerspace wanna make a PR to update the site?

Deleted user · Answer 15 · Sat Feb 11 2017 06:12:05 GMT+0800 (China Standard Time)

Cool thanks, I just updated https://archives.ipfs.io

Enrico Fasoli · Answer 16 · Mon Feb 13 2017 16:59:14 GMT+0800 (China Standard Time)

@leerspace sorry for replying late, looks like I lost my copy of the original archive. Thanks for updating it! 👍

Ken Herner · Answer 17 · Thu Jul 13 2017 04:34:05 GMT+0800 (China Standard Time)

Hello, I've updated the archive using the xkcd-downloader.js script offered in the repo, and it now has all comics up to the latest today (1862). It is currently pinned on my laptop, but I will pin it to my server when I get home so it will be available at all times.

QmdmQXB2mzChmMeKY47C43LxUdg1NDJ5MWcKMKxDu7RgQm

Deleted user · Answer 18 · Thu Jul 13 2017 07:43:54 GMT+0800 (China Standard Time)

Awesome, thanks @chosenken -- also pinned it on nihal.i.ipfs.io

Ken Herner · Answer 19 · Tue Jul 18 2017 03:05:44 GMT+0800 (China Standard Time)

Updated again to 1864, but this time attached it to an ipns: QmTaW8vRj4SkM6JhqVhAsibQE9PdJb5PQ2FMwPPc6gBi2h. I might work on a script that pulls new comics down and updates the ipns when it changes.

cjqf · Answer 20 · Fri Jun 08 2018 07:33:20 GMT+0800 (China Standard Time)

I'd like to update this one again, but to facilitate programmatic access, I'd like to change the structure slightly to something more like:

/ipfs/Qmahash/1/1 - Barrel - Part 1.png
...
/ipfs/Qmbhash/2003/2003 - Presidential Succession.png

where the comic files are contained within a 'folder' defined by the number rather than number and name. Any issues with this? I can host on our server, but I'd also be happy to submit a PR to update the archives.

Oli Evans · Answer 21 · Wed Jul 04 2018 23:20:39 GMT+0800 (China Standard Time)

@carsonfarmer that'd be rad. I've no objection to simplifying the folder structure.

Oli Evans · Answer 22 · Wed Jul 04 2018 23:21:50 GMT+0800 (China Standard Time)

I plan to feature this data set on the start page of the new IPLD Explorer page in the ipfs-webui.

Oli Evans · Answer 23 · Wed Jul 04 2018 23:29:32 GMT+0800 (China Standard Time)

@carsonfarmer could we get some zero padding on those indexes?

/ipfs/QmHash/0001/0001 - Barrel - Part 1.png
...
/ipfs/Qmbhash/2003/2003 - Presidential Succession.png

cjqf · Answer 24 · Thu Jul 19 2018 01:17:11 GMT+0800 (China Standard Time)

Ah sorry, was on vacation. Yes I'll update the indexes and post here when ready.

Hugo · Answer 25 · Sat Aug 25 2018 06:50:08 GMT+0800 (China Standard Time)

I've written a new program in go that creates an archive such as the following, /ipfs/QmdAChzF2JQCx9icrmYHZhFdRSv9TpRjq5q1v5b3ANpxRf. It also includes a csv with an index of post titles, published date and post number. I have submitted a pr, ipfs/awesome-ipfs#193

Hodlon · Answer 26 · Mon Dec 17 2018 03:13:54 GMT+0800 (China Standard Time)

If I were someone who wanted to start pinning content like XKCD their own node(s) to help network redundancy, which hash would I use? There are many different hashes presented in this thread and it's not clear to me which one is the most relevant or up-to-date.

This seems to be the most up-to-date: Qmb8wsGZNXt5VXZh1pEmYynjB6Euqpq3HYyeAdw2vScTkQ

This one seems to be an exact replica of the first but it has a different hash, perhaps due to the lack of .html on "about" and "license": QmPVP4sDre9rtYahGvcjv3Fqet3oQyqrH5xS33d4YBVFme

https://archives.ipfs.io/ seems to favor the first one, TkQ.

And this seems to be the same content as everything above but in a different structure: https://ipfs.io/ipfs/QmdAChzF2JQCx9icrmYHZhFdRSv9TpRjq5q1v5b3ANpxRf

So, which hash do I pin?

Michele Guerini Rocco · Answer 27 · Wed Apr 03 2019 18:46:50 GMT+0800 (China Standard Time)

I think it would be a good idea to zero pad the numbers: right know it's impossible to browse.

Jozef Hollý · Answer 28 · Mon Apr 08 2019 05:22:25 GMT+0800 (China Standard Time)

I added padding to the downloader: /ipfs/QmX4pR3KKdivwY9Pn5mHNYi5FRhtTqapMfFmW4SYesstxU/xkcd-downloader.js
Whole: https://ipfs.io/ipfs/QmX4pR3KKdivwY9Pn5mHNYi5FRhtTqapMfFmW4SYesstxU
I'm not sure why it's not sorted though

Henrique Dias · Answer 29 · Tue Apr 30 2019 03:53:12 GMT+0800 (China Standard Time)

I've just created an XKCD archive at /ipns/xkcd.hacdias.com. It is updated every day 😄 Please see the repository for more info: https://github.com/hacdias/xkcd.hacdias.com

Henrique Dias · Answer 30 · Sat Jul 06 2019 01:33:48 GMT+0800 (China Standard Time)

@Stebalien should we update the index to /ipns/xkcd.hacdias.com?

Steven Allen · Answer 31 · Sat Jul 06 2019 04:12:39 GMT+0800 (China Standard Time)

I take it your comment is moot now that this has all moved to awesome.ipfs.io?

Steven Allen · Answer 32 · Sat Jul 06 2019 04:13:22 GMT+0800 (China Standard Time)

Is it possible to have two links? Ideally, we'd link to an immutable version as well.

Henrique Dias · Answer 33 · Sat Jul 06 2019 04:14:44 GMT+0800 (China Standard Time)

@Stebalien we could also add an immutable version, but that would be a snapshot somewhere in the past. But yes, we could add it as a description perhaps.

Steven Allen · Answer 34 · Sat Jul 06 2019 04:30:24 GMT+0800 (China Standard Time)

Yeah, I know. It's just that archives that rely on DNS sketch me out a bit.

Henrique Dias · Answer 35 · Sat Jul 06 2019 15:10:54 GMT+0800 (China Standard Time)

Please see ipfs/awesome-ipfs#261.