w3c / rdf-canon

RDF Dataset Canonicalization (deliverable of the RCH working group)

Home Page:https://w3c.github.io/rdf-canon/spec/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issues with parameterized hashing algorithms used internally

philarcher opened this issue · comments

See Sebastian's email of 2023-09-12 and subsequent disucssion

The thread has become fairly long on the mailing list; as, in the past, all the major discussions happened on the GitHub issues, would it be possible to move the discussion here?

Looking at the discussion on the thread starting with Sebastian's mail, I believe there is a consensus emerging whereby the rdf-canon spec should at least provide a way to generate not only a hash value for the canonical graph but also the metadata identifying its parameter(s). Although PA's mail referred to the definitions of URL-s indicating the various combinations of hash function usages, I must admit I am closer to Dave Longley's option, namely:

I don't think it's a good idea to invent a new hash metadata expression mechanism in this group. These things exist elsewhere (such as multihash, or SRI, or RFC 6920)

Also, although in theory there is the option of using a different hash function for, on the one hand, the RDFC algorithm proper and for, on the other hand, the hashing of the canonical serialization, I am not sure there will be many cases when these two will be really different. And if they are, these can be handled by the specific applications or contexts, which may have their own means of indicating these details (see Manu's mail on the DI example). Those contexts, as Manu indicated, are not interested by all this.

My personal conclusion that it is perfectly enough if the rdf-canon spec does explicitly define the canonical hash for a graph, which is defined to be the hash function of the canonical serialization using the hash function used by the RDFC proper, and acknowledges that there are other applications out there that do not need this.

To be a bit more specific, what I propose to do is:

  1. Create a new section of after the current §5 "Serialization" and before §6 "Privacy Consideration", entitled something like "Canonical hash". This section would:
    1. Define the notion of a canonical hash, defined as the result of hashing the serialization of the canonical graph resulting from RDFC 1.0, using the same hash function used in the canonicalization algorithm proper
    2. The canonical hash would be expressed using the SRI syntax, i.e., {hash function name}-{base64 encoding of the hashing result}. We can refer to the Integrity attribute from the Subresource Integrity Spec for the definition (and I presume the specref [[SRI]] should work for the respec reference).
    3. The section would also include an editorial note saying that a specific application may use a different combination of hash functions depending on the application's context; in that case the canonical hash could be ignored and the application may define its own way of expressing the hash value of the graph. However, this specification does not define how that would be expressed.
  2. We should extend the test suite in one of two ways:
    1. we add 2 tests that test the canonical hash using both sha256 and sha384; or
    2. we expand all tests to include, in the manifest, the expected canonical hash whose equality to the calculated canonical hash should also be checked for the test to pass

I believe the use cases described by Sebastian would work with that approach without further ado by using the canonical hash of the graphs. On the other hand, the situation described by Manu could be done, bypassing (possibly) the canonical hash value for something else (if needed).

WDYT?


Why SRI and not multihash or the RFC? Here is my reasoning

  • Multihash is not a published standard yet, and we would have to struggle through the W3C process for a normative reference. Besides, our spec refers, at this moment, only to sha256 and sha384, so multihash may be a sledgehammer for what we want.
  • RFC 6920 Naming Things with Hashes defines a URI using the hash, so one could have something like ni:///sha-256;DFVBHJ... to name a canonical graph. Which may be an interesting thing to do as well, but this goes, I believe, beyond what we are aiming at. (Let alone the fact that introducing a URL for an RDF Dataset, which is what this thing would mean, may raise all kinds of question in RDF land that I am not sure we want to open here. Leave that to other groups.)
  • SRI is the simplest among the three listed. It is simple to generate through something like ${hashFunctionName}-{base64_hash} and just as easy to interpret it. And it does the job for us.

Just for reference in this thread, here's the resolution we had at TPAC about introducing a parameter for the hash function: https://www.w3.org/2023/09/11-rch-minutes.html#r01