ipfs-inactive / archives

[ARCHIVED] Repo to coordinate archival efforts with IPFS

Home Page:https://awesome.ipfs.io/datasets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Ensure that Rabin fingerprinting works with large datasets

flyingzumwalt opened this issue · comments

From https://botbot.me/freenode/ipfs/2017-01-29/?msg=80105342&page=1

[ani 10:03 pm] Rabin seems to be failing at larger jobs. Will try with more memory and CPU but it's stalling around 5GB/49

@whyrusleeping Could you re-add the test dataset from #126 using Rabin fingerprinting to make sure it doesn't choke?

Rabin sharding from what I understand isn't that cheap and it is the reason it chokes on it.

Before we invest time into it I would recommend checking if it gives any benefits in multiple areas: in file, cross files (directory) and cross datasets vs normal chunking.

As from what I understand it might prevent some other duplications from happening.

I appreciate your caution. In theory rabin fingerprinting should be beneficial for exactly this case, where many people have downloaded the same datasets from the same sources but might have slight variations in the copies they downloaded. Our default chunking algorithm (fixed-size 256kb chunks) prevents them from even trying to deduplicate files. People like @20zinnm are motivated to test how the code performs for this use case. I want to make sure that the code is ready for them to proceed.

Keep in mind:

  • We can't do the tests you've suggested if the chunking functions fails to even process the input files
  • Though rabin fingerprinting is not the default chunking algorithm, it is an officially supported one. It should work. If it's not working, we should at least record and diagnose the bug.

I'll take #136 as confirmation that rabin fingerprinting works for large datasets. Great work @20zinnm. I'll open a new issue for the tests that @Kubuxu suggested.

@flyingzumwalt It's still very heavy in terms of performance and needs high specs for anything > a few gigs. But yes, in principle it should work.