w3c / rdf-canon

RDF Dataset Canonicalization (deliverable of the RCH working group)

Home Page:https://w3c.github.io/rdf-canon/spec/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

N-Quads Serialization

gkellogg opened this issue · comments

In partial resolution to #4, the issue proposes to create a "Serialization" section which describes the N-Quads serialization of the Normalized Dataset by serializing each quad in the dataset using N-Quads Canonicalization with each quad output in code point order.

Informative text may use an intermediate step to describe the Normalized Dataset as an (Infra) array of tuples where each tuple is an ordered array made from the canonicalized terms of the quad, and the outermost array placed in code point order based on the concatenation of the terms in each tuple.

Informative text can also describe creating a hash from the result of the N-Quads serialization.

In partial resolution to #4, the issue proposes to create a "Serialization" section which describes the N-Quads serialization of the Normalized Dataset by serializing each quad in the dataset using N-Quads Canonicalization with each quad output in code point order.

+1. But we should also say something about sorting (or not).

To avoid over-complication, I would propose to say that the algorithm produces a sorted n-quads serialization. Sorting may be unnecessary for some applications, but it is safe to ignore the fact that it is sorting. The act of sorting is probably not significant, time-wise, compared to the rest of the algorithm, so it is not worth optimizing on making that optional.

Also: by "serialization" via n-quads, do we mean:

  1. A single string with each quad separated by a single \n character; or
  2. An array of individual canonical n-quads.

I would opt for the latter.

Informative text may use an intermediate step to describe the Normalized Dataset as an (Infra) array of tuples where each tuple is an ordered array made from the canonicalized terms of the quad, and the outermost array placed in code point order based on the concatenation of the terms in each tuple.

+1

Informative text can also describe creating a hash from the result of the N-Quads serialization.

Don't we have to normatively say that hashing, for the sake of this deliverable, uses the same algorithm that is used in the spec elsewhere, ie, SHA-256? That would be a normative statement.

This is all the more important because, at some point, we said that we may want to "parametrize" the hash algorithm used, and we may probably say that the same hash would be used at this point, too. Otherwise we would introduce a possible source of errors.

(Because we would return the n-quads separately anyway, a user may choose a different hashing algorithm. Which is fine, that is a deliberate choice done for some good reasons. But the 'standard' hashing can still be counted on being SHA-256.)

To avoid over-complication, I would propose to say that the algorithm produces a sorted n-quads serialization.

I agree. I think it's better to have the algorithm perform the sorting for a couple of reasons:

  1. Sorting by code point order may not be the default sorting behavior in a given language (it isn't in JavaScript, for example), so it's helpful to have that done automatically.
  2. It's starting to become clear over the last week, for selective disclosure use cases, that we should allow an optional output of a mapping of the original input quad order indices to the new sorted quad indices. I understand we've discussed how a mapping of input blank node labels to output blank node labels might be challenging / not possible, but this seems like it could be done fairly easily.

@iherman,

Informative text can also describe creating a hash from the result of the N-Quads serialization.

Don't we have to normatively say that hashing, for the sake of this deliverable, uses the same algorithm that is used in the spec elsewhere, ie, SHA-256? That would be a normative statement.

I think we should have the spec parameterize the hash function used internally (such that we only talk about "running the hash function" wherever it is used) and that we should normatively state that the hash function used internally is SHA-256 for this version of the algorithm. I think both of these things are more or less already done -- but maybe there's a tweak to improve this here or there.

I agree that informative text can be used to describe creating a hash on the N-Quads output result.

To avoid over-complication, I would propose to say that the algorithm produces a sorted n-quads serialization.

I agree. I think it's better to have the algorithm perform the sorting for a couple of reasons:

  1. Sorting by code point order may not be the default sorting behavior in a given language (it isn't in JavaScript, for example), so it's helpful to have that done automatically.

When I said "using N-Quads Canonicalization with each quad output in code point order", that's what I meant. Text will likely use more algorithmic text as in other sections, as what's meant by the array can be confusing given the recursive nature of the description.

  1. It's starting to become clear over the last week, for selective disclosure use cases, that we should allow an optional output of a mapping of the original input quad order indices to the new sorted quad indices. I understand we've discussed how a mapping of input blank node labels to output blank node labels might be challenging / not possible, but this seems like it could be done fairly easily.

I think the use of the output blank nodes as represented in the normalized dataset should be straightforward. The described internal representation would be an array where each element is a tuple composed of the terms from the quad and each entry in this array ordered in code point order of the concatenated serialized terms of the tuple. Each term is serialized as described in N-Quads Canonicalization.

@iherman,

Informative text can also describe creating a hash from the result of the N-Quads serialization.

Don't we have to normatively say that hashing, for the sake of this deliverable, uses the same algorithm that is used in the spec elsewhere, ie, SHA-256? That would be a normative statement.

I think we should have the spec parameterize the hash function used internally (such that we only talk about "running the hash function" wherever it is used) and that we should normatively state that the hash function used internally is SHA-256 for this version of the algorithm. I think both of these things are more or less already done -- but maybe there's a tweak to improve this here or there.

The spec currently uses the term hash algorithm, which defines that it uses SHA-256, and is localized to URDNA2015. I'd use the same term when informatively describing how to create a hash for the resulting N-Quads document. Do you think this requires some further parameterization?

@gkellogg,

I think the use of the output blank nodes as represented in the normalized dataset should be straightforward. The described internal representation would be an array where each element is a tuple composed of the terms from the quad and each entry in this array ordered in code point order of the concatenated serialized terms of the tuple. Each term is serialized as described in N-Quads Canonicalization.

Sure -- but I think there's an additional need for a mapping of the indices of the input quads to their positions in the output. Perhaps I'm misunderstanding you.

To give some background on the use case I'm talking about, this kind of mapping is needed to perform matching for selective disclosure use cases. As an example of how those use cases work:

Suppose you have T total quads. These quads are canonized, producing a code-point-ordered list of N-Quads with canonical bnode labels. Each one of the N-Quads is hashed and signed using cryptographic that supports selective disclosure.

Later, some number of these quads, D, where D < T, are to be disclosed. These D quads have blank nodes in them and are to be disclosed using some syntax that does not preserve blank node labels. When these D quads are canonized, the labels produced will not match what was signed. It must be possible to match the input D quads to the output canonized quads so that the discloser can build and include a mapping from the new canonical bnode labels to the original (and signed) canonical bnode labels so that signature verification will pass.

Enabling the caller of the canonicalization algorithm to request a mapping from original quad indices to newly canonized and sorted quad indices (in the output) enables the above.

The spec currently uses the term hash algorithm, which defines that it uses SHA-256, and is localized to URDNA2015. I'd use the same term when informatively describing how to create a hash for the resulting N-Quads document. Do you think this requires some further parameterization?

I don't think we need to further parameterize the algorithm itself, but we should not assume that the same hash algorithm used in the canonicalization algorithm will be used when hashing the N-Quads. In fact, I think that could be highly problematic. We should say that URDNA2015 only uses SHA-256 internally. It's mainly important that we keep the parameterization in case we need to do a new version in the future. Using something else is non-standard.

We do want to say, informally, that another hash function can be used on the output. We should highlight that it can be any other hash function and doesn't need to be the same one. And if another is used, that does not change what was used internally in URDNA2015. So these are different constructs and it's important to keep them separated.

So, I think we're getting off topic for a Serialization section to discuss in detail what a structure that would support selective disclosure would look like, and this should probably be discussed in it's own issue. The key thing for us to enable is to describe the terminology for things that would enable selective disclosure mechanisms to be defined elsewhere.

If we define an list quad list ordered based on the canonical n-quads form of each quad in the normalized dataset, where each element is a list composed of the canonicalized terms from its corresponding quad, this provides a stable structure where the index of any item in this list represents a quad that could be referenced from D. I don't think we need to define a mapping from each canonicalized blank node identifier to the index of entries in quad list which use it, although that might be done by another specification.

Enabling the caller of the canonicalization algorithm to request a mapping from original quad indices to newly canonized and sorted quad indices (in the output) enables the above.

Just be be clear on what we're talking about, the input dataset is unordered and no blank node labels are persistent or possibly even present. The only order we can describe is for the quads from the normalized dataset. Any subset of these, where the blank node identifiers may be regenerated, is outside the scope of this specification.

Best to do a PR on serialization and either discuss changes to the PR, or a new issue with a follow-on PR to make sure we enable the selective disclosure use-cases.

@gkellogg, I moved the subdiscussion over here: #89

The spec currently uses the term hash algorithm, which defines that it uses SHA-256, and is localized to URDNA2015. I'd use the same term when informatively describing how to create a hash for the resulting N-Quads document. Do you think this requires some further parameterization?

That should be enough.