Serialization

Question

Serialization

gkellogg opened this issue 7 years ago · comments

The spec probably needs a section on the serialization format and an IANA section. We talk about quads, and the presumption is that the output is N-Quads in canonical form, but it probably deserves it's own media sub-type. Perhaps application/canoncalized+n-quads would be an appropriate sub-type..

David Booth · Answer 1 · Sat Jun 24 2017 00:44:35 GMT+0800 (China Standard Time)

Agreed. But I'd like to make another plea for the term "canonical" instead of "normalized" before defining the media sub-type, for the reasons described in issue #2 .

Manu Sporny · Answer 2 · Sat Jun 24 2017 01:09:48 GMT+0800 (China Standard Time)

+1, @dlongley and I are opposed to use of 'normalize' as it's led to a number of problems when describing these specs to people. "Canonical" and "Canonicalized" is the more accurate computer-sciencey term.

So, +1 to application/canonicalized+n-quads

Gregg Kellogg · Answer 3 · Sat Jun 24 2017 01:17:31 GMT+0800 (China Standard Time)

Se edited.

David I. Lehn · Answer 4 · Sat Jan 19 2019 05:38:27 GMT+0800 (China Standard Time)

I was just looking into some issues with this in our implementations.

spec: Not sure if it existed before, but we should refer to https://www.w3.org/TR/n-quads/. Or at least I think so. I know one of the other json-ld specs referred to turtle spec since that's all that existed at the time.
parsing: Based on https://www.w3.org/TR/n-quads/#sec-parsing, the n-quads parser needs to do a unescaping step for literals into unicode strings.
literal canonicalization: https://www.w3.org/TR/n-quads/#sec-grammar only lists 4 values that are required to be escaped: [#x22#x5C#xA#xD]. Despite the oddness of binary in n-quads, I suppose that means a canonicalization step could just escape those, and only those, values. (Nulls are ok in literals which seems like asking for trouble!) We'd have to say which escapes are used. Probably the ECHAR short sequences, though UCHAR ones would technically work too.
tests: We have no tests for this behavior. I started to add some but I'm not sure what the output format is so hard to write them! Might need to have some suggested code based tests for the edge cases that are difficult to write in .nq files like crazy nulls and backspaces and such.
performance: Doing some of this escaping/unescaping could be a performance issue. Implementations may need flags to avoid extra work when it's known to not be needed.

David I. Lehn · Answer 5 · Sun Jan 20 2019 08:52:32 GMT+0800 (China Standard Time)

@gkellogg I looked at your ruby code and it was hard to tell which code in particular would be used for this. I think it is in the ntriples writer? The escaping code there is quite involved. That seems like a good idea for readable non-canonical output. But in this case I was thinking maybe the easiest route is to only escape the 4 required chars in the n-quads spec.

Gregg Kellogg · Answer 6 · Sun Jan 20 2019 10:44:17 GMT+0800 (China Standard Time)

Yes, the writer is what does URI and literal escaping. It was escaping too much.

I found a reader unescaping issue for \', and there seems to be something remaining for the \\u0039 test.

Easiest is to always output canonical form; It could optionally do more escaping, but it's obviously not really necessary.

David I. Lehn · Answer 7 · Mon Jan 21 2019 05:11:55 GMT+0800 (China Standard Time)

Is the "canonical form" defined somewhere? I figured just escaping minimum required chars was a reasonable guess at such a form.

Gregg Kellogg · Answer 8 · Mon Jan 21 2019 07:43:24 GMT+0800 (China Standard Time)

N-Quads canonical form comes from N-Triples canonical form. This is what must be produced in output for consistent hashing.

David I. Lehn · Answer 9 · Mon Jan 21 2019 08:06:43 GMT+0800 (China Standard Time)

Perfect. Looks like I guessed the right format. ;-)

Dave Longley · Answer 10 · Tue Nov 26 2019 23:01:06 GMT+0800 (China Standard Time)

We also need to specify that canonical N-Quads must be lexicographically sorted so that, for example, any two pieces of software that hash a concrete serialization of the canonicalized dataset will produce the same hash.

See: https://github.com/w3c-dvcg/ld-signatures/issues/28