cdnjs

Question

cdnjs

victorb opened this issue 9 years ago · comments

Would love to have a IPFS compatible fork of https://github.com/cdnjs/cdnjs serving files via IPFS. Super large repository though but will give it a try to develop the integration locally.

Juan Benet · Answer 1 · Tue Oct 27 2015 17:20:41 GMT+0800 (China Standard Time)

cc @lgierth

David A Roberts · Answer 2 · Wed Oct 28 2015 08:51:51 GMT+0800 (China Standard Time)

On hold until ipfs add performance improves, too many small files

Andrew Chin · Answer 3 · Sat Dec 12 2015 01:51:26 GMT+0800 (China Standard Time)

Ok well I bit the bullet and added this to IPFS. Here is the result: http://ipfs.io/ipfs/QmZJKsLpebYqHApRLeoLhj2NsXZ2JoXqhSWkxafxaBXYu7

A few observations. Note that for all of the below comments, the daemon was not running, and I was IPFS version 0.3.11-dev

Performance wasn't as terrible as I thought it would have been, but it was pretty darn slow. Took about 15 or 16 hours to add everything
When I first added things, I did it on directory at a time:
$ ls ajax/libs | xargs -N1 ipfs add -r
As expected, this kept memory usage in check (mostly).
Later, after all the data was in IPFS, I tried added the entire ajax/libs directory at once. This went quite quickly, however memory usage ballooned very quickly to about 75% of my machines available memory (about 24 GB used out of a total of 32GB available). I killed ipfs at this point, and wasn't able to let it run to completion.
Since I wanted to have a single tree containing everything, and I wasn't able to add the whole directory at once, I had to resort to building up a tree with ipfs object patch, by manually calling add-link for each directory. This was very easy to do, but it was very slow! It appears that this large repo is causing some serious delays when running IPFS commands:

$ time ipfs object new unixfs-dir
QmUNLLsPACCz1vLxQVkXqqLX5R1X345qqfHbsf67hvA3Nn

real    0m14.855s
user    0m19.136s
sys     0m0.632s

This means that is took 6 hours to call object patch add-link for every top-level directory. I believe this 15 second delay is just some startup delay, so if I had the daemon running at this point, I believe this process would have been much faster (since there would be no startup delay)

IPFS deduplication appears to be helping a lot here! Compare the size of the cdnjs/ajax/lib directory to ~/.ipfs :

achin@diax:~/devel$ du -sh cdnjs/ajax/libs/ ~/.ipfs/
22G     cdnjs/ajax/libs/
13G     /home/achin/.ipfs/

I took a look at every single hash to see what ones were duplicated. In total, about 180000 difference hashes were dedups at least once in this tree. The most dedupd hash was a small 1x1 PNG file with 12376 links. The second most deduped file is an empty file with 696 links.

I've uploaded a file that lists all dedups. I haven't done the analysis to see if it would account for the savings reported by du

Time to pin everything. Not terrible:

$ time ipfs pin add -r QmZJKsLpebYqHApRLeoLhj2NsXZ2JoXqhSWkxafxaBXYu7
pinned QmZJKsLpebYqHApRLeoLhj2NsXZ2JoXqhSWkxafxaBXYu7 recursively

real    7m24.915s
user    5m16.676s
sys     0m32.560s

I've uploaded a copy of the output of ipfs diag sys for your info

David A Roberts · Answer 4 · Sat Dec 12 2015 14:10:45 GMT+0800 (China Standard Time)

@eminence awesome, thanks for tackling this and writing up the details :)

Cc @whyrusleeping @rht @diasdavid

Andrew Chin · Answer 5 · Sun Dec 13 2015 14:51:20 GMT+0800 (China Standard Time)

Second day follow up notes:

Updating this repo is pretty straight-forward. After fetching the cdnjs repo, and fast-forwarding to the latest master, I can use git diff --dirstat to see just the libraries that changed. This makes it easy to ipfs add -r -q ajax/libs/library_that_was_updated | tail -1 and then ipfs object patch add-link. I've confirmed that the object patch is nice and quick when I leave the daemon running.
Note that I'd love a mode in ipfs add that just gives me the top-level hash of the thing that I'm adding, so I can dispense with the tail -1 stuff.
This manual tree management is very doable, but also annoying. I admit I've not been following the Files API stuff -- would it help here?
To make use of these hosted libraries, I've added them to IPNS here: http://ipfs.io/ipns/em32.net/archives/cdnjs

But IPNS is still too slow. Or rather, it is not consistently fast. I wrote a script that requests the same file several times, and records how long it takes. The data for 60 requests is here. The summery is that most of requests are quick (about 0.5 seconds or less). But some requests take 10 seconds, others up to 60 seconds. When your website is loading many different assets (like cdnjs libs), this can result is very noticeable delays.

I have (anecdotal) evidence that even trying to ipfs get some of these large cdnjs directories can cause run-away memory leaks. This may make it difficult for this repo to be effectively mirrored/pinned.
I've committed to @whyrusleeping to re-run these tests with the 0.4.0 branch. I'll report back in the coming days

David A Roberts · Answer 6 · Sun Dec 13 2015 16:55:38 GMT+0800 (China Standard Time)

Note that I'd love a mode in ipfs add that just gives me the top-level hash of the thing that I'm adding, so I can dispense with the tail -1 stuff.

I think ipfs add -q does that

This manual tree management is very doable, but also annoying. I admit I've not been following the Files API stuff -- would it help here?

Probably, yes.

Juan Benet · Answer 7 · Mon Dec 14 2015 05:44:12 GMT+0800 (China Standard Time)

I think ipfs add -q does that

ipfs add -q outputs only hashes. still need tail -n1. but yeah, we should add a flag that outputs only the last root. maybe --only-root ? or -Q, --root-quiet? idk.

Juan Benet · Answer 8 · Mon Dec 14 2015 05:44:32 GMT+0800 (China Standard Time)

This manual tree management is very doable, but also annoying. I admit I've not been following the Files API stuff -- would it help here?

yes, very much so.

Juan Benet · Answer 9 · Mon Dec 14 2015 05:45:10 GMT+0800 (China Standard Time)

But IPNS is still too slow. Or rather, it is not consistently fast. I wrote a script that requests the same file several times, and records how long it takes. The data for 60 requests is here. The summery is that most of requests are quick (about 0.5 seconds or less). But some requests take 10 seconds, others up to 60 seconds. When your website is loading many different assets (like cdnjs libs), this can result is very noticeable delays.

yes, ipns is still very slow. there's caching (the fast results), but we need to fix this at the dht query level

Andrew Chin · Answer 10 · Tue Dec 15 2015 22:08:48 GMT+0800 (China Standard Time)

After testing with 0.4.0, the experience was much better!

Memory usage was totally under control, and mostly stable the entire time.
I was able to add the entire libs directory at once. This took about 9 hours, which is much quicker than before. This isn't a totally fair comparison, since I used different hardware for each test (0.4.0 versus 0.3.8), but even trying to take into account the hardware differences, I believe 0.4.0 was faster
Rescanning the entire libs directory, after the initial add, took about 50 minutes. This would represent the time needed to do an incremental update. Obviously using the git diff trick from above would improve this, but 50 minutes seems pretty acceptable to me.

If anyone is running a 0.4.0 node, the result is here: QmRnvPSCNmYHdYQAo6JUWJPW8uVQv7z6D9nSQmw5qbHVWy

Juan Benet · Answer 11 · Wed Dec 16 2015 00:23:39 GMT+0800 (China Standard Time)

Good news. Though still waaaay too slow for my liking.

Adding concurrent add will help here. I believe we have not added this.
On Tue, Dec 15, 2015 at 09:08 Andrew Chin notifications@github.com wrote:

After testing with 0.4.0, the experience was much better!

Memory usage was totally under control, and mostly stable the entire
time.

I was able to add the entire libs directory at once. This took about 9
hours, which is much quicker than before. This isn't a totally fair
comparison, since I used different hardware for each test (0.4.0 versus
0.3.8), but even trying to take into account the hardware differences, I
believe 0.4.0 was faster

Rescanning the entire libs directory, after the initial add, took
about 50 minutes. This would represent the time needed to do an incremental
update. Obviously using the git diff trick from above would improve this,
but 50 minutes seems pretty acceptable to me.

—
Reply to this email directly or view it on GitHub
#35 (comment).

Andrew Chin · Answer 12 · Wed Dec 16 2015 04:04:07 GMT+0800 (China Standard Time)

For me, the gold standard of merkledags is probably git :) So I timed how long it takes to add this directory tree into a new git repo -- 41 minutes! About an order of magnitude faster than IPFS.

Whyrusleeping · Answer 13 · Wed Dec 16 2015 09:15:30 GMT+0800 (China Standard Time)

@eminence yeah, we've got a little ways to go still, but keep in mind that git uses a faster hashing algorithm, and doesnt chunk objects. I have one more changeset to apply that should get close to leveling the playing field. Just have to polish it a bit.

rht · Answer 14 · Wed Dec 16 2015 12:51:41 GMT+0800 (China Standard Time)

In some cases, e.g. lots of 1MB files instead of 1KB files, ipfs is way faster than git ipfs/kubo#1973 (comment)

rht · Answer 15 · Wed Dec 16 2015 12:54:48 GMT+0800 (China Standard Time)

git uses a faster hashing algorithm, and doesnt chunk objects.

The chunking equivalent in git would be git gc, but only after several revisions?

Juan Benet · Answer 16 · Wed Dec 16 2015 18:17:36 GMT+0800 (China Standard Time)

we should consider moving the default to https://blake2.net/ it's designed for this.

rht · Answer 17 · Wed Dec 16 2015 22:34:02 GMT+0800 (China Standard Time)

That would explain rsync's speed, since it uses blake2. Perhaps it is possible to go ~O(rsync) with this.

It would be more effective, though, to use blake2 into 0.4.0 before its release (combining all the incompatible changes into one release?).

Juan Benet · Answer 18 · Thu Dec 17 2015 03:36:45 GMT+0800 (China Standard Time)

It would be more effective, though, to use blake2 into 0.4.0 before its release (combining all the incompatible changes into one release?).

blake2 isnt incompatible. https://github.com/jbenet/multihash :)

Juan Benet · Answer 19 · Thu Dec 17 2015 03:37:00 GMT+0800 (China Standard Time)

oh i guess it is, because it's not included in the 0.3.x codebase, right.

Juan Benet · Answer 20 · Thu Dec 17 2015 03:37:42 GMT+0800 (China Standard Time)

yeah it would be nice, but idk if we can land it in time. @whyrusleeping wants to ship 0.4.0 soon. would be nice to add blake2 and ipld support, but not sure if we'll get there in time

Andrew Chin · Answer 21 · Thu Dec 17 2015 04:59:17 GMT+0800 (China Standard Time)

Even though git is super fast here, ipfs isn't unusablely slow. as an end user, i have so many more things on my wishlist that are more important to me than speed

(edit: but i do indeed understand the desire to put all breaking changes into 1 release)

David A Roberts · Answer 22 · Thu Dec 17 2015 09:39:08 GMT+0800 (China Standard Time)

@rht What percentage of the total runtime is currently consumed by the hash function?

rht · Answer 23 · Fri Dec 18 2015 10:50:23 GMT+0800 (China Standard Time)

...I did instead with testing dev0.4.0+blake2b. There is more cpu consumption and a speedup but not noticeable due to the jitter like in ipfs/kubo#2039 (comment). Perhaps it could be significant once dev0.4.0 goes to ~O(git) or ~O(rsync). Though I could have just checked the runtime percentage of the hashing.

Currently, both ipfs and git are slow for adding large things.
edit: s/moving/adding/

Jakub Sztandera · Answer 24 · Mon Feb 22 2016 14:36:49 GMT+0800 (China Standard Time)

I re-archived it on my own. Available under fs:/ipns/cdnjs.ipfs.ovh

Notes: It took about 6h, I had one crash during that time.

Christian Ferrier · Answer 25 · Fri Jun 10 2016 02:52:37 GMT+0800 (China Standard Time)

Anyone interested in a deterministic, efficient node script that wraps the command-line to do the git pull / git diff / ipfs add / ipfs object patch manual work that does this into the existing index you last put into the IPFS network? Could also handle the memory issue (restarting the daemon or sets of potential offending processes when they get near a threshold).

About four months ago I did this and I'll need to dig up my work sitting on one of my VPSes; simple dumb 'ipfs add -r' every time I got pulled was quite expensive. Might be reusable in other related archive targets (other big git repos that are more archive repositories than single codebases). Just requires 'git', 'ipfs' and 'node' on the path of the indexing machine. Maybe someone has already done this or perhaps the ipfs internals have improved enough to remedy the need for this?

I just found out about this "IPFS Archives" project at the Decentralized Web Summit, pretty exciting. @whyrusleeping still doing the workshop on IPFS Archives and Versioning, so tagging him.

Jakub Sztandera · Answer 26 · Fri Jun 10 2016 02:58:33 GMT+0800 (China Standard Time)

Me and @magik6k are working on adding the cdnjs currently.

I added whole cdnjs some time ago, it is available under fs:/ipns/cdnjs.ipfs.ovh but I think most of that data is currently gone (due to problems with local IPFS repo).

Adding cdnjs is nice stress test for IPFS: ipfs/kubo#2823 ipfs/kubo#2828

Jakub Sztandera · Answer 27 · Mon Jun 13 2016 18:20:56 GMT+0800 (China Standard Time)

Note about publishing the cdnjs:

We shouldn't use /ipns/ to reference the cdnjs in HTML. IPNS currently doesn't work with browser level caching (no HTTP 304 response code if path is IPNS path).

Sean Lang · Answer 28 · Tue Jun 14 2016 02:31:54 GMT+0800 (China Standard Time)

Updated CDNJS hash: https://ipfs.io/ipfs/QmPJnEf5933cXteZmaMJkphCW1CtpcMMVx7N6rUr8cZAok
And a little script for generating it, in case anyone else wants to try:

#!/bin/bash
HASH=$(ipfs object new unixfs-dir)
for FILE in cdnjs/ajax/libs/*; do
  LIB=$(basename $FILE)
  echo "adding $LIB"
  LIB_HASH=$(ipfs add -r -H -q "$FILE" | tail -n 1)
  HASH=$(ipfs object patch $HASH add-link $LIB $LIB_HASH)
done
echo "final hash: $HASH"

...However, this doesn't handle symlinks, which do exist in CDNJS. I've gotta decide how to deal with those.

Andrew Chin · Answer 29 · Tue Jun 14 2016 02:36:30 GMT+0800 (China Standard Time)

What needs handling exactly (with regard to symlinks)?

Sean Lang · Answer 30 · Tue Jun 14 2016 02:48:21 GMT+0800 (China Standard Time)

If you click on one right now (like https://ipfs.io/ipfs/QmPJnEf5933cXteZmaMJkphCW1CtpcMMVx7N6rUr8cZAok/zocial) it's broken, whereas https://cdnjs.cloudflare.com/ajax/libs/zocial/1.2.0/css/zocial.css works.

Andrew Chin · Answer 31 · Tue Jun 14 2016 02:55:33 GMT+0800 (China Standard Time)

Ahh. I see, yes. I would just pretend the symlink doesn't exist and just add the contents of the directory. Let IPFS's intrinsic de-dup handle it from there

Andrew Chin · Answer 32 · Tue Jun 14 2016 21:19:27 GMT+0800 (China Standard Time)

BTW, I am trying to pin this hash, but I can't download everything. Are you still seeding it, @slang800 ?

Jakub Sztandera · Answer 33 · Tue Jun 14 2016 21:44:57 GMT+0800 (China Standard Time)

@eminence you are better of, instead of pinning it right away, doing ipfs refs -r $HASH, this way you will see progress, just make sure to set high GC limit (the cdnjs will be about 20GB) and pin it afterwards.

Andrew Chin · Answer 34 · Tue Jun 14 2016 21:54:19 GMT+0800 (China Standard Time)

Actually, I am already doing that (using ipfs refs). I was trying to run ipfs refs QmWLDMm1CC2E5f7M5dcbi9hW9K9XuckFUfGc39GRgGYtQC, but after about 30 hours of waiting nothing came back. But I just restarted the command just now and it returned instantly. Not sure what that means... ipfs bug?

Andrew Chin · Answer 35 · Tue Jun 14 2016 22:01:08 GMT+0800 (China Standard Time)

Or maybe I just can't connect to @slang800 node?

When I run ipfs dht findprovs QmeQYktmYVqAbRju1boAyimC9J1dAACYY7h5KNqfEEEamb, nothing at all is returned. But if I request that hash via the ipfs.io public gateways, it loads almost immediately. Then ipfs dht findprovs returns something:

> ipfs dht findprovs QmeQYktmYVqAbRju1boAyimC9J1dAACYY7h5KNqfEEEamb
QmVyqFjQJTqVmKRBk4sL9F9Af7fCRdA9YNK845NSHRD8zJ
QmSoLer265NRgSp2LA3dPaeykiS1J6DifTC88f5uVQKNAd

> ipfs swarm peers |wc -l
117

So I'm a little confused about what's actually happening here

Jakub Sztandera · Answer 36 · Tue Jun 14 2016 22:07:28 GMT+0800 (China Standard Time)

It is possible that you can't connect with @slang800 node directly. Maybe his node failed to penetrate NAT and only connected to SolarNet nodes/can't accept connections, this would be why you can see things work via ipfs.io but not through your own instance.

Sean Lang · Answer 37 · Tue Jun 14 2016 22:14:18 GMT+0800 (China Standard Time)

Sorry - I was hosting this on my desktop and turned it off this morning to move my desk to the opposite side of the room. It should be up now. :D

Jakub Sztandera · Answer 38 · Tue Jun 14 2016 22:32:20 GMT+0800 (China Standard Time)

@slang800 can you share your peerID (result of ipfs id -f"<id>\n") for debugging?

Sean Lang · Answer 39 · Tue Jun 14 2016 23:08:58 GMT+0800 (China Standard Time)

Sure, it's QmVKZAcpqrrQoBsc7eyEFnqFjHTDAUdJWMTBDJ34jR7ueU

Victor Bjelkholm · Answer 40 · Wed Jan 11 2017 18:49:52 GMT+0800 (China Standard Time)

Submitted an updated build based on github.com/cdnjs/cdnjs commit 4fabd85c986d57a61e0fbd8504cf15d67f60ada6 here: #82

New hash would be: QmRrnfFUgx81KZR9ibEcxDXgevoj9e5DydB5v168yembnX - https://ipfs.io/ipfs/QmRrnfFUgx81KZR9ibEcxDXgevoj9e5DydB5v168yembnX

It's stored at Pollux right now.

Peter Dave Hello · Answer 41 · Sat Jul 01 2017 17:04:41 GMT+0800 (China Standard Time)

@cdnjs maintainer here, anything I can help here?

Victor Bjelkholm · Answer 42 · Sat Jul 01 2017 19:12:09 GMT+0800 (China Standard Time)

@PeterDaveHello Thanks for checking in here! I think something that would be really useful is adding "ipfs.io" as one of the CDN providers, however, I'm not sure if you have support for adding more, currently it's just Cloudflare there without the ability to change.

If we did that, we would need to setup the updating/adding to be more automatic, right now it's me doing a manual git pull and adding it to IPFS, then making a PR here. The flow would be something like:

Once a day, update from git
Add to IPFS
Submit updated hash as a PR to this repository
Submit updated hash as a PR to cdnjs/cdnjs
You make a new deploy with updated hash on the website

What you think?

Peter Dave Hello · Answer 43 · Sat Jul 01 2017 19:42:28 GMT+0800 (China Standard Time)

I'm afraid that we can't do that as we update the library and website every 5~10 mins since the 3k+ libraries update very frequently. This is all automatically, without manual review and merge.

I also wonder if it's a good idea to push ipfs when I don't understand this project enough, when we're going to provide service officially, we'll have responsible on that, especially when there is anything wrong, so, sorry, that might not be something I can do right now. Maybe I can help update the files from my side if you want, currently, the files looks dated on ipfs.

Victor Bjelkholm · Answer 44 · Sat Jul 01 2017 19:48:21 GMT+0800 (China Standard Time)

I'm afraid that we can't do that as we update the library and website every 5~10 mins since the 3k+ libraries update very frequently. This is all automatically, without manual review and merge.

Yeah, understandable, and making it all automatic is a much better way to go from the get-go so makes sense.

Something else we can do from our side is having the same interface you run on cdnjs.com but slightly modified to hook-up into our version of cdnjs, and deployed on cdnjs.ipfs.io or something like that. Would need to make sure it's always up-to-date, which will take some effort but not be super hard.

Maybe I can help update the files from my side if you want, currently, the files looks dated on ipfs.

Yeah, as I mentioned, the process is right now manual but in reality, should be fully automated. Will have some more thoughts about this at a later point.

Thanks for jumping in here and sharing your thoughts 👍

Peter Dave Hello · Answer 45 · Sat Jul 01 2017 20:08:07 GMT+0800 (China Standard Time)

@victorbjelkholm thanks! Let me know if I can help update cdnjs on ipfs more up-to-update and frequently :)

Łukasz Magiera · Answer 46 · Sat Jul 01 2017 20:15:28 GMT+0800 (China Standard Time)

IPFS stores files much like git does, so updating it 'live' shouldn't really be a problem. This updating could be done using the ipfs files or ipfs object api quite easily.

The best way I can see it done in case of cdnjs is to have a tool that would apply updates based on which files changed in git commits.

Only thing I'm not sure about is how IPNS would react to that frequency of updates.

Peter Dave Hello · Answer 47 · Sat Jul 01 2017 23:41:27 GMT+0800 (China Standard Time)

Yeah we can try to integrate that in our buildScript

cdnjs

Memory usage was totally under control, and mostly stable the entire time.

Memory usage was totally under control, and mostly stable the entire
time.