is it ok to suggest a "custom" codec based on AWS S3 ETag?

Question

is it ok to suggest a "custom" codec based on AWS S3 ETag?

yarikoptic opened this issue 4 years ago · comments

Yaroslav Halchenko commented 4 years ago

Can we "reserve" a tag in the table.csv for a custom, based on AWS S3 ETag (md5 of chunk md5s) + chunk size + total size.

Background: For our project (DANDI archive, dandi/dandi-archive#146 for more info) we need to 1. ensure validity of the upload to S3 bucket and 2. de-duplicate uploads. Since any out of band hash computation (e.g. of sha256) poses additional logistical and computational challenge we would like to use S3 ETag-based hash so we can gain "free" immediate assurance that file's hash corresponds to the upload, and then use that hash in the DB/keystore to be able to avoid expensive re-uploads of the same data blob. Since S3 upload could be arbitrarily chunked, we cannot simply use ETag since it would not be re-computable solely given the load/file. That is why we would like to establish our own "algorithm" to decide on the chunk size (so chunking and thus ETag would stay consistent), and then compliment computed ETag with that chunksize (to facilitate recomputation even in the absence of our "chunk size decision making algorithm") and total size (since why not and useful information, and could help to avoid an unlikely md5 hash collision).

We are also considering to use multihash as codec-independent, and data-dependent non-random (unlike UUID) "identity" key. For that purpose I would like to reserve a tag/codec in the table (named e.g. dandi-s3-etag). Would that be possible, or not advised/recommended?

Rod Vagg · Answer 1 · Thu Mar 11 2021 13:07:58 GMT+0800 (China Standard Time)

I think if I'm understanding you correctly that you're essentially defining a hash algorithm here, even if it's kind of novel, is that right? There's no dictate that a multihash has to be a quality, one-way hash function (we have identity after all), but it does imply such a thing. It might depend more on how you're intending to use the multicodec value. Are you going to be making CIDs for these things, or something else?

Yaroslav Halchenko · Answer 2 · Thu Mar 11 2021 23:39:01 GMT+0800 (China Standard Time)

I think if I'm understanding you correctly that you're essentially defining a hash algorithm here, even if it's kind of novel, is that right?

yes - could be said that we are "defining a hash algorithm" (even if it is just specification over AWS S3 Etag, which is in turn based on md5).

There's no dictate that a multihash has to be a quality, one-way hash function (we have identity after all), but it does imply such a thing.

AFAIK its "quality" should be on par with md5 (not cryptographic at this time of human evolution) itself.

Are you going to be making CIDs for these things, or something else?

primary target - having a self-describing multihash .

Rod Vagg · Answer 3 · Sat Mar 13 2021 08:05:45 GMT+0800 (China Standard Time)

I think this should probably be fine, if you want to open a PR it can be discussed further (reference this issue), choose a higher number, if you can find a collection of roughly similar entries then that's a bonus.

Yaroslav Halchenko · Answer 4 · Sat Mar 13 2021 08:53:05 GMT+0800 (China Standard Time)

Great, thank you! As such I will consider this issue/question answered and will close it. I will submit a PR if/when we decide to go multihash route.