w3c / rdf-canon

RDF Dataset Canonicalization (deliverable of the RCH working group)

Home Page:https://w3c.github.io/rdf-canon/spec/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

What is the way to go with duplicate triples?

iherman opened this issue · comments

This issue came up in an issue raised on my implementation (iherman/rdfjs-c14n#10). The question boils down to the following.

Say the input is the following:

@prefix ex: <http://example.com/#> .
ex:one ex:two _:a .
ex:one ex:two _:a .               

this is definitely a valid Turtle. When I run it through my canonicalization, the canonical quads are:

<http://example.com/#one> <http://example.com/#two> _:c14n0 .
<http://example.com/#one> <http://example.com/#two> _:c14n0 .

is this the correct output? Or should it be

<http://example.com/#one> <http://example.com/#two> _:c14n0 .

Clearly, the hash will be different, so we have to have a clear answer.

I am tempted to say that the output is correct. Otherwise, we would have to do some preprocessing on the nquads that the algorithm receives to filter out duplicates; but that does not seem something that should be part of RDFC...

cc @jeswr @pchampin @dlongley @gkellogg

I have tried a slight modification as follows:

Input:

@prefix ex: <http://example.com/#> .
ex:one ex:two _:a .
ex:one ex:two _:b . 

The output is

<http://example.com/#one> <http://example.com/#two> _:c14n0 .
<http://example.com/#one> <http://example.com/#two> _:c14n1 .

which, again, looks correct to me...

My 2 cents:

We should be respecting set semantics

In particular

@prefix ex: <http://example.com/#> .
ex:one ex:two ex:three .
ex:one ex:two ex:three . 

Describes exactly the same dataset as

@prefix ex: <http://example.com/#> .
ex:one ex:two ex:three . 

and therefore should recieve the same hash.

Equally

@prefix ex: <http://example.com/#> .
ex:one ex:two _:a .
ex:one ex:two _:a .   

Describes the same dataset as

@prefix ex: <http://example.com/#> .
ex:one ex:two _:a .

and therefore should recieve the same hash.

On the other hand #191 (comment) describes 2 distinct triples and therefore this example should remain as written in that comment.

@jeswr I sympathize with your opinion, but the question is whether it is the responsibility of the RDFC algorithm to reinforce the set semantics or not. That is not clear-cut (and definitely not explicitly said in the specification, so we may end up with the need of adding something to it). And we would also need a test case (unless there is one already, @gkellogg?)

FWIW, I agree with you on #191 (comment).

Many RDF parsers will emit a stream of Triples/Quads, which can include duplicates. Once it goes into a Dataset, duplicates should go away because of the semantics of datasets. In the case of the Ruby implementation, the default Dataset implementation uses a three-level hash structure, so duplicates are impossible if the hash to the same value.

The RDF Canonicalization algorithm specifically operates over an input dataset, which being an abstract RDF dataset can contain no duplicates. I don't think the spec needs to say anything more.

I am not sure I agree, @gkellogg. The specific question is: is it required that the algorithm checks whether the input collection (eg array) of quads is really a set or not? If not, what should the algorithm do? Turning it into a real set before processing?

Alternatively, do we say that the responsability of avoiding duplicates is the caller's and the result is not defined?

The input is a dataset, which has its own semantics. It should. It be up to canonicalization to ensure that a dataset has only unique quads. By definition, a dataset contains unique triples/quads. If you pass in something that does not adhere to the dataset semantics, and should not be a consideration for the algorithm, but perhaps for your implementation.

I am not sure if I parse your previous comment correctly... Is what you propose that it is entirely up to the implementation what it does if the input data does not fully abide to the dataset semantics (it may just go ahead and possibly produce multiple quads, it may include an extra step to remove duplicates, or it may return an error).

I agree that the normative part of the algorithm should not be changed. I wonder whether a note could be added to the spec on the statement above (if we agree on it).

I'm with @gkellogg on this: the input of the algorithm is a dataset, which by definition, can not contain multiple versions of the same quad. This makes the spec unambiguous (and, I'm sorry to say, @iherman's implementation non-compliant w.r.t. the example above 😉 ...). But I believe Ivan agrees also:

I agree that the normative part of the algorithm should not be changed. I wonder whether a note could be added to the spec on the statement above (if we agree on it).

Adding a note would definitely be nice. Adding a test-case also, maybe ?

A note to the effect that the input dataset is expected to be well-formed W.R.T RDF 1.1 Concepts might be appropriate, but I think this is implied.

The tests in #192 are certainly reasonable, but will require implementations to re-submit EARL reports.

A note to the effect that the input dataset is expected to be well-formed W.R.T RDF 1.1 Concepts might be appropriate, but I think this is implied.

As the discussion shows, it is probably better to make it explicit...

The tests in #192 are certainly reasonable, but will require implementations to re-submit EARL reports.

Yep, that is correct. But I do not expect this to be a major issue (my PR for the change on my code is already done, running the test suite will be done after I have merged...)