cboettig / contentid

:package: R package for working with Content Identifiers

Home Page:http://cboettig.github.io/contentid

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Should contentid support register() and resolve() of 'cloud-native' addresses

cboettig opened this issue · comments

Currently, register() assumes the data can be accessed as either a local file or a (public) HTTP(S) address. (Note that register_tsv() can actually work with public FTP addresses, since curl handles both directly, even though hash archive (register_ha()) cannot.)

Similarly, resolve() resolves an identifier to a local path; if it finds only remote URL sources, it automatically downloads that source (optionally adding it to content store, otherwise leaving it in a temp file) and checks the hash. Arguably, we would like resolve() (and hence sources()) to return an s3:// address if a file has been 'registered' in an s3 bucket (including a public bucket). This would allow cloud-native applications (like arrow or duckdb, for instance) to query the dataset directly over S3, without ever attempting to download the whole thing.

Unfortunately, it's not clear how to implement this. Ideally, this would be built around an abstraction of a filesystem, that encompassed both curl, local POSIX filesystems, and "cloud" filesystems. Such an abstraction class would have methods which could return the appropriate 'path' to an object, it's content hash, etc. fsspec in python may be close to this.

(This could be framed more generally about supporting alternative protocols beyond HTTP and POSIX filesystems, as illustrated in fsspec)

Actually I think this is more practical than I had originally anticipated. A few further thoughts.

We can already accomplish such workflows with a bit of curation from sources() and https addresses, which can be used from public buckets or other web addresses. Tools like the httpfs extension in duckdb allow nice range request queries against static tabular files (tsv, csv, parquet, etc) without downloading the whole object; hence we can query for an object by it's contentid by using sources, extract the https, and range request subset on it without ever downloading. A helper function could potentially streamline this a bit (by guessing the 'best' URL from sources with a few heuristics).

This could be extended in a few ways, such as supporting alias-like heursitic identifiers (e.g. hashes of the first/last 100 bytes a la #86 ) as the identifier, though note that such heuristic ids are not required here.

While the focus is probably on public data so focusing on http protocols makes sense, this would be nice to work with S3, both to support authenticated access, globbing/partitions, and arrow (which doesn't yet have http protocol support). Moreover, S3 systems can be a natural registry+store, in that we can (typically) request md5sums from the S3 API, even though sha256 does not seem to be an option. contentid is now a bit more agnostic to algorithm than initially. A more md5-sum-focused plays nicely both with Zenodo and S3 could be compelling.

(meanwhile the rate limiting on sha256-based Software Heritage makes it a less ideal option).