w3c-ccg / rdf-dataset-canonicalization

RDF Dataset Canonicalization

Home Page:https://www.w3.org/TR/rdf-canon/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Serialization

gkellogg opened this issue · comments

The spec probably needs a section on the serialization format and an IANA section. We talk about quads, and the presumption is that the output is N-Quads in canonical form, but it probably deserves it's own media sub-type. Perhaps application/canoncalized+n-quads would be an appropriate sub-type..

Agreed. But I'd like to make another plea for the term "canonical" instead of "normalized" before defining the media sub-type, for the reasons described in issue #2 .

+1, @dlongley and I are opposed to use of 'normalize' as it's led to a number of problems when describing these specs to people. "Canonical" and "Canonicalized" is the more accurate computer-sciencey term.

So, +1 to application/canonicalized+n-quads

Se edited.

I was just looking into some issues with this in our implementations.

  • spec: Not sure if it existed before, but we should refer to https://www.w3.org/TR/n-quads/. Or at least I think so. I know one of the other json-ld specs referred to turtle spec since that's all that existed at the time.
  • parsing: Based on https://www.w3.org/TR/n-quads/#sec-parsing, the n-quads parser needs to do a unescaping step for literals into unicode strings.
  • literal canonicalization: https://www.w3.org/TR/n-quads/#sec-grammar only lists 4 values that are required to be escaped: [#x22#x5C#xA#xD]. Despite the oddness of binary in n-quads, I suppose that means a canonicalization step could just escape those, and only those, values. (Nulls are ok in literals which seems like asking for trouble!) We'd have to say which escapes are used. Probably the ECHAR short sequences, though UCHAR ones would technically work too.
  • tests: We have no tests for this behavior. I started to add some but I'm not sure what the output format is so hard to write them! Might need to have some suggested code based tests for the edge cases that are difficult to write in .nq files like crazy nulls and backspaces and such.
  • performance: Doing some of this escaping/unescaping could be a performance issue. Implementations may need flags to avoid extra work when it's known to not be needed.

@gkellogg I looked at your ruby code and it was hard to tell which code in particular would be used for this. I think it is in the ntriples writer? The escaping code there is quite involved. That seems like a good idea for readable non-canonical output. But in this case I was thinking maybe the easiest route is to only escape the 4 required chars in the n-quads spec.

Yes, the writer is what does URI and literal escaping. It was escaping too much.

I found a reader unescaping issue for \', and there seems to be something remaining for the \\u0039 test.

Easiest is to always output canonical form; It could optionally do more escaping, but it's obviously not really necessary.

Is the "canonical form" defined somewhere? I figured just escaping minimum required chars was a reasonable guess at such a form.

N-Quads canonical form comes from N-Triples canonical form. This is what must be produced in output for consistent hashing.

Perfect. Looks like I guessed the right format. ;-)

We also need to specify that canonical N-Quads must be lexicographically sorted so that, for example, any two pieces of software that hash a concrete serialization of the canonicalized dataset will produce the same hash.

See: https://github.com/w3c-dvcg/ld-signatures/issues/28