w3c / rdf-canon

RDF Dataset Canonicalization (deliverable of the RCH working group)

Home Page:https://w3c.github.io/rdf-canon/spec/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ambiguity about canonical N-Triples / N-Quads

pchampin opened this issue · comments

the specification of canonical N-Triples is silent about the datatype of xsd:string literals. More specifically :

    "hello world"

and

   "hello world"^^<http://www.w3.org/2001/XMLSchema#string>

are equivalent terms in N-Triples and N-Quads, and the spec does not say which one should be used as the canonical form.

Given that this is lacking from the N-Triples spec, the rd-canon spec should chose one and be explicit about it.

This should also be fed to the rdf-star WG, who can also update the N-Triples and N-Quads specs accordingly.

Other than for Canonicalization, RDF serialization formats are typically restricted to parsing, not serializing; JSON-LD being the main exception.

RDF Concepts discusses this with MAY language:

Please note that concrete syntaxes may support simple literals consisting of only a lexical form without any datatype IRI or language tag. Simple literals are syntactic sugar for abstract syntax literals with the datatype IRI http://www.w3.org/2001/XMLSchema#string. Similarly, most concrete syntaxes represent language-tagged strings without the datatype IRI because it always equals http://www.w3.org/1999/02/22-rdf-syntax-ns#langString.

Making this a MUST for canonical forms is indeed something that needs to go into the update N-Triples and N-Quads specs in their canonicalization sections. Similarly, rdf:langString MUST NOT be used for a language-tagged literals, although the grammar doesn't support this in any case.

This, and the previous note on the need for Canonicalization in N-Triples should be in cross-referenced issues for those specs, but best wait until after their repositories have been set up, which should happen before too much longer.

[@gkellogg] RDF serialization formats are typically restricted to parsing, not serializing

I'm not at all sure what you mean by that... "serialization formats" are not for "serializing"?

[@gkellogg] RDF serialization formats are typically restricted to parsing, not serializing

I'm not at all sure what you mean by that... "serialization formats" are not for "serializing"?

Does sound like an oxymoron :) But, there are typically no normative statements on how to serialize RDF graphs or datasets, other than for N-Triples canonical form, which has it's own problems, and restricts itself to serializing a single triple, not a graph. The specs describe the syntax and how to parse it, but not how to serialize it. Another exception is JSON-LD, which _does_describe how to serialize datasets to JSON-LD.

there are typically no normative statements on how to serialize RDF graphs or datasets

Well, that seems like a horrendous oversight and, dare I say, a bug in each document with such lack. It's no wonder there are nonstop issues with interop and uptake, slowly growing interest in RDF/LD notwithstanding!

Well, that seems like a horrendous oversight

Well, the implicit contract of any serializer is to serialize your data to whatever parses back to the same data.

But granted, this could be made explicit, probably with a more specific definition of what we consider to be the "same" data (in RDF, this means "isomorphism", because blank nodes... well, you know!).

Well, that seems like a horrendous oversight

I don't think RDF uptake can be laid on the lack of specs to define explicitly how to serialize an RDF Graph/Dataset, nor should it IMHO. At most might be a statement that serialized graph/dataset representations MUST be a valid representation of the associated grammar rules. If you think in terms of computer languages, the abstract RDF syntax is closer to a machine language, with N-Triples and N-Quads like assembly languages, and Turtle/TriG/RDFa/JSON-LD like high level languages targeting that machine language. An argument can be made that there is a normative way to represent the abstract syntax in N-Triples and N-Quads (not withstanding Blank Node identifiers), but not for the others. JSON-LD provides a way to transform a dataset into JSON-LD, but not the way to do so.

Looking elsewhere, SPARQL describes an algebra that is targeted by the syntax. There are systems that will re-serialize the algebra into the SPARQL Grammar, but no normative statements about doing so.

We provide a number of examples for representing data in the various concrete examples, and define how to parse those representations to transform them into the underlying representation. Trying to codify how to re-create that serialization from the underlying representation is certainly outside our charter, and not something we should get into in any case, IMHO.

But granted, this could be made explicit, probably with a more specific definition of what we consider to be the "same" data (in RDF, this means "isomorphism", because blank nodes... well, you know!).

We do define graph/dataset isomorphism, conceivably a statement could be made that an serialization of a graph or dataset, when re-parsed, MUST be isomorphic to that graph or dataset.

Has this been solved by merging #96 ?

Yes, I believe it has.

On the 10 May 2023 call, the WG decided to close this issue.