w3c / rdf-canon

RDF Dataset Canonicalization (deliverable of the RCH working group)

Home Page:https://w3c.github.io/rdf-canon/spec/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Random graph theorist's remarks on WD 22 May 2023

lars-hellstrom opened this issue · comments

Some points in the text that I feel a need to remark upon:

You never explain the abbreviation c14n. In the text proper it only seems to appear as part of identifiers, so technically not in need of an explanation, but it's still rather mysterious until you google it.

§1, explanation of NP-intermediate

it is believed to be an NP-Intermediate problem, that is, neither known to be solvable in polynomial time nor NP-complete

Whether something is an NP-intermediate problem does not depend on what is known about it. The 'believed' already covers for the fact that its status is not known.

As a scholar, I would also find it natural if there somewhere in the document was explained how (undirected) Graph Isomorphism reduces to RDF Graph C14n. Easiest reduction I can think of would be that each vertex and each edge has a blank node, and the triples all encode the incidence relation (i.e., all predicates are the same, having vertex node as subject and edge node as object). If however you say that as a matter of style going into such detail is not appropriate, then I certainly would believe that.

§3.1, definition of gossip path

The definition here is of a gossip path, not some uniquely determined the gossip path. (Later, the URDNA2015 algorithm selects one gossip path, but the details of that is singled out are not found here.)

At times the later text also seems to identify a gossip path (graph-theoretical object) with a string representing that gossip path. You may want to consider promoting that latter concept to something explicitly defined; it could make the exposition clearer.

§3.1 or §3.2: Dataset isomorphism

Dataset isomorphism is not mentioned as something defined (whether in this document or by reference), but it really should be, since it is central to defining the problem of canonicalisation.

§4, editor's note

I strongly support this sentiment, that the algorithm should go into an appendix.

Things that should be normatively defined include:

  • canonicalisation function: function mapping RDF datasets to isomorphic RDF datasets, with the property that two datasets are mapped to the same thing if and only if they are isomorphic.
  • canonicalisation partial function: function mapping some RDF datasets to isomorphic RDF datasets. One probably wants the situation that c14n(G) defined but c14n(H) undefined to mean that G and H are not isomorphic. (ISTR there was some issue that mentioned having the algorithm give up if presented with too difficult problems, but right now I can't find the number. When the algorithm is configured to do that, it implements a partial function.)
  • runtime guarantees for algorithm. Where DoS is a concern, one wants to know an algorithm at least terminates within a reasonable time.

§4.6 and 4.8, degree as in "Hash First Degree Quads" and "Hash N-Degree Quads"

This use of 'degree' is just plain wrong, although I fear it might be too deeply entrenched for you to consider changing. The standard meaning of 'degree' in the context of graphs is as a synonym of valency (number of incidences of edges to the vertex under consideration). The concept at play here is rather that of radius of the neighbourhood being considered. I suppose for bounded valency graphs you could arrive at the radius being the degree of the polynomial bound on the cardinality of the neighbourhood considered, but even that is a far-fetched choice of terminology; depth (as in 'recursion depth') seems more natural.

However, the risk of confusion is probably low.

§4.8.1, definition of shortest gossip path

Am I reading this right, that the shortest gossip path need not be the path of shortest length in the graph sense, if some predicates have very long IRIs and other predicates very short IRIs? Two 50 character IRIs are longer than three 20 character IRIs. (Of course, I may have missed some hashing step that would give components uniform length.)

Multiple graphs dataset example?

As far as I can tell, all examples only have the default graph in the dataset. With quads, the graph is just one more endpoint of the edge, so probably no increased complexity as far as cases in the algorithm are concerned, but it looks a bit strange that it speaks about quads when throughout there are just triples in the examples (especially when serialised for hashing).

I find myself wondering how hard the algorithm would have to determine that a Möbius ladder graph is not isomorphic to a circular ladder graph of same order. (For graph isomorphism to appear as a subproblem of c14n, consider a dataset with two graphs whose names are blank nodes. At what point can it tell those nodes get different hashes?) Or for that matter, Möbius ladder to a circular ladder of order 2 more. The algorithm does not directly use multiplicity of hashes, does it? Only whether hashes are unique or not. Hence a 10-vertex 3-regular graph is not so obviously distinct from a 12-vertex 3-regular graph.

Canonicalise versus normalise

I would suggest 'canon*' as the better term. Whereas 'normal form' certainly describes what you are going for, just plain 'normalise' admits many alternative meanings, for example on the lines of "getting rid of silly parts we don't care about". In the context of graph colouring, "normalising" could easily mean "reduce edge multiplicities to 1".

Admin question: shouldn't this issue be closed, now that the various points have been "delegated" to individual issues?

I suggest that @gkellogg change the From https://github.com/w3c/rdf-canon/issues/111: in the delegated issues to From @lars-hellstrom in https://github.com/w3c/rdf-canon/issues/111: such that @lars-hellstrom receives notifications of updates to those issues and it's clear to whom we should address, "Does this handling sufficiently address your issue/question?" on each.

I think to change this, we need feedback specifically to @lars-hellstrom so that we can change the status of this PR to another "wr" state. @lars-hellstrom is welcome to follow the delegated issues if he has more concerns; I don't think we need to do this by proxy, as many of the issues we were addressing anyway.

@lars-hellstrom As co-chair of the RCH WG, I'd like to thank you for your review and feedback about the document! The group discussed your points on a call this week, here are minutes: https://www.w3.org/2023/06/13-rch-minutes.html#t04

As you can see above, @gkellogg created multiple more granular issues, in order to address specific points you raised.

You are of course welcome to comment more on the individual issues in this repo. Perhaps you would also be interested in joining the WG as a member and contributing PRs yourself, to help address some of the topics you identified?

@lars-hellstrom, thanks again for your comments. Separate issues were raised and addressed based upon this input.