w3c / EasierRDF

Making RDF easy enough for most developers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Blank nodes

dbooth-boston opened this issue · comments

They are an important convenience for RDF
authors, but they cause insidious downstream complications.
They have subtle, confusing semantics. (As Nathan Rixham
once aptly put it, a blank node is "a name that is not
a name".) Blank nodes are special second-class citizens
in RDF. They cannot be used as predicates, and they are not
stable identifiers. A blank node label cannot be used in
a follow-up SPARQL query to refer to the same node, which
is justifiably viewed as completely broken by RDF newbies.
Blank nodes also cause duplicate triples (non-lean) when the
same data is loaded more than once, which can easily happen
when data is merged from different sources. And they cause
difficulties with canonicalization.

"A problem we have with blank nodes that might make us banish them is
the impossibility to use them in reified statements."
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0092.html

IDEA: Allow expressions as first-class entities

"Allowing expressions to be predicate arguments eliminates most cases
where blank nodes are required. In bio-ontologies, we have large numbers
of simple EL expressions that create huge numbers of blank nodes that
complicate SPARQL queries. Similarly, for representing equations like
E=mc^2, it's blank nodes or some kind of awful (from a programmer-pov)
unnecessary IDs."
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0045.html

IDEA: Separate existential quantifier (blank node) logic from RDF

"I'm starting to believe an idea of separating the existential quantifier
(blank node) logic from RDF itself to a separate semantic extension on
top of RDF should be explored. As evidenced by this discussion it is
difficult to understand and talk about. If separate, it could be
expanded by negation to have the full power of FOL as Pat suggested. If
such separation was possible and made the basic operations (merges,
canonicalization) on RDF data sets easier to reason about and implement,
it would be of quite beneficial."
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0172.html

IDEA: Add explicit scope mechanism for blank nodes

"Bnodes introduced to encode
structures like n-ary relational assertions, or lists, or some
complicated piece of OWL syntax, should have a very narrow scope
corresponding to the exact boundaries of those structures, and
hence should be ‘invisible’ from outside (which is why it is fine
to make them vanish in a higher-level syntax using [ ] or ( ).) . . . .
imagine a variant of NTriples in which a subset of
triples can be enclosed in brackets, say [ ] (or something else
if these are already taken) to indicate that any bnode ID in a
triple inside the bracket is local to those triples".
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0218.html

"A better system, which would allow for more elaborate structures, would be to have convention of labelled scope brackets of the form [ID ]"
https://lists.w3.org/Archives/Public/semantic-web/2018Dec/att-0018/00-part
https://lists.w3.org/Archives/Public/semantic-web/2018Dec/0018.html

IDEA: Define equality and hash functions on types

"For a common approach to addresses maybe a group like Schema org could
publish ==() and hash() functions on their
https://schema.org/PostalAddress page, possibly open sourced. In the
interim they could nominate an existing service like
https://smartystreets.com/, which is an address validation API I've just
discovered, there seem to be several. At a later stage they could publish a
fuzzy matching function there too."
https://lists.w3.org/Archives/Public/semantic-web/2018Dec/0001.html

Personally, I like the idea of having expressions avoid the need for blank nodes. Note that there could be a semantic layer on top of RDF to support such constructs, without needing to adapt the triple-based data model - perhaps this could be tied in with one of the big ideas (?) But that kind of solution would still involve blank nodes behind-the-scenes, of course.

As more background, TriG currently allows blank node labels to span multiple graphs within the same document: "BlankNodes sharing the same label in differently labeled graph statements are considered to be the same BlankNode."
https://www.w3.org/TR/trig/#terms-blanks-nodes

One of the practical difficulties of bnodes is use in structures; RDF lists, use in values for quantity+unit, because this is fragile. Lists can be broken in some way, or values having two units. Checking is "whole graph", not at the level of input stream when feedback is more useful. With guaranteed correct data, systems can store and handle in optimal form.

Scoping to small sections of the document, like () in Turtle is an interesting possibility as it closes the checkpoint.

Something for N-triples is needed as well, but also for Turtle because determining pretty printing is the same as checking and expensive on large data (larger than RAM). A solution which allows streaming the graph out is needed (experience from both Turtle and RDF/XML output - users like pretty but at scale have to accept a reduced form).

In support of: IDEA: Separate existential quantifier (blank node) logic from RDF

Assuming P<GI<NP, the creation and verification of a digital signature of an arbitrary RDF graph cannot be done in polynomial time.

Carroll, Jeremy J. "Signing RDF graphs." International Semantic Web Conference. Springer, Berlin, Heidelberg, 2003.

RDF needs to be processable in polynomial time, otherwise critical (business) use cases are nearly impossible if 100 % compliance with the RDF spec is expected.

@jeremycarroll: any further insights on this matter?

I think Aidan Hogan and co-authors have done the most research on blank nodes more recently:

I do not know a github handle for Aidan, but I will email him to see if I can get his attention on this. I spoke to him by phone earlier today and he is super busy with teaching right now, but intends to follow up in the next couple of days.

There is a risk of "selection bias" . People who like what there is or who are "getting on with stuff" don't write papers for journals!

From https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0170.html

IMO this is a good example that bnodes actually are foremost: structure.

From https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0053.html

> They have subtle, confusing semantics.
I find them very simple, thanks.

We shouldn't be banning stuff that exists in the existing RDF standard just because a specific programmer or end-user profile struggles with its essence and/or utility.

A Blank Node is how RDF (a Language) delivers the functionality of an Indefinite Pronoun.

@kidehen

So you are saying that "cannot be done in polynomial time" is a struggle of "a specific programmer or end-user"?

It's not about banning blank nodes, it's about separating them into an on-top layer, so that the core layer can be properly supported & widespread.

Blank nodes should be limited to use cases actually requiring them (e.g. OWL vocabularies) but only as an optional opt-in feature.

RDF is not a language, it is "a framework for representing information in the Web", so I'm not sure of the necessity of indefinite pronouns at is core. It would be helpful for the discussion if you could elaborate on that.

@rnavarropiris ,

So you are saying that "cannot be done in polynomial time" is a struggle of "a specific programmer or end-user"?

That wasn't the point of anything I said.

Blank nodes should be limited to use cases actually requiring them (e.g. OWL vocabularies) but only as an optional opt-in feature.

I don't know how I or anyone else has indicated otherwise re. blank nodes i.e., they are supposed to be used when required. Use them where a pronoun would be applied in a structured sentence.

RDF is not a language, it is "a framework for representing information in the Web", so I'm not sure of the necessity of indefinite pronouns at is core. It would be helpful for the discussion if you could elaborate on that.

RDF is an Abstract Language.

You can create RDF sentences using a variety of notations and serialize for persistent storage to a variety of document types.

RDF has nothing to do with the Web in its most basic sense i.e., it is a framework that makes systematic use of signs, syntax, and semantics for encoding and decoding information.

The subject, predicate, and object roles of an RDF sentence are basically a rendition of "parts of speech" in natural language.

An HTTP-based Web comes into play when you apply Linked Data principles to RDF sentence construction, along the following lines:

  1. Identify things using HTTP URIs
  2. Describe things using subject, predicate, and object structured sentences where the subject, predicate, and object (optionally) are identified using an HTTP URI

slide_50_subjec-predicate-object-relations-sentences-new-key-layering

Related

Notwithstanding a lot of this comment, I want to pick up on one part:

Blank nodes should be limited to use cases actually requiring them (e.g. OWL vocabularies) but only as an optional opt-in feature.

I don't know how I or anyone else has indicated otherwise re. blank nodes i.e., they are supposed to be used when required. Use them where a pronoun would be applied in a structured sentence.

I want to challenge the final sentence of this strongly.
We might use "this", "that", "it", "she" etc. as pronouns extensively.
It is deeply unhelpful if, when representing the same knowledge in RDF, we choose to use blank nodes.
For me, the big thing about RDF is that it can have globally useful IDs, so I and others can make statements about resources.
It (sic) is the same reason we have nouns in natural languages. But in RDF you can't point at things with your finger, or easily use a pronoun that is understood to refer to an earlier ID (although context might help).
You say "they are supposed to be used when required" - that is fine - I think the problem is that we (that is all of us!) may not agree on the requirements.
My requirement says they should (only) be used in a context where I can't easily or even when I can't logically specify a well-founded ID.

@HughGlaser ,

I want to challenge the final sentence of this strongly.
We might use "this", "that", "it", "she" etc. as pronouns extensively.
It is deeply unhelpful if, when representing the same knowledge in RDF, we choose to use blank nodes.

To be crystal clear on this matter, regarding my fundamental point:
I am simply stating that Blank Nodes should be used where the RDF content creator deems appropriate. Basically, it shouldn't be removed from the existing standard.

A Blank Node brings Indefinite Pronoun functionality (and power) to RDF sentence construction.

I've [provided examples] (http://kingsley.idehen.net/public_home/kidehen/Public/Linked%20Data%20Documents/Tutorials/conceptual-graphs-to-turtle-examples/), as I always do.

  <> 
  a schema:CreativeWork, schema:WebPage ;
  schema:description "Translating Conceptual Graph Notation: [Black]<-(Attr)<-[Cat: Yojo]->(On)->[Mat]->(Attr)->[Red] to RDF-Turtle" ;
  schema:about 
  
  [ a <#Cat> ; 
    schema:name "Yojo" ;
	schema:identifier <#thisYojo> ;
	schema:image <https://thumb1.shutterstock.com/display_pic_with_logo/3291197/497469223/stock-vector-black-cat-lying-on-the-red-mat-497469223.jpg> ;
	schema:color [ <#rgbCode> "#000000"^^xsd:hexBinary ;
	           schema:name "Black"
			 ] ;
	<#on> [
		   a <http://www.productontology.org/id/Mat> ;
		   schema:identifier <#thisMat> ;
	       schema:color [ <#rgbCode> "#FF0000"^^xsd:hexBinary ;
		              schema:name "Red"
					]
		  ] 
  ] .
  
 <#on>
 a owl:ObjectProperty ;
 schema:name "on" ;
 schema:description "two perceived items in vertical juxtapostion." ;
 rdfs:domain owl:Thing ; 
 rdfs:range owl:Thing . 

screen shot 2018-12-19 at 10 52 29 am

I hope I've clarified my position which simply boils down to leave RDF as is based on its ability to deal with complex "horses for courses" matters regarding structured data representation. Prematurely deprecating existing functionality on the basis of style or the usual "make it simpler" subjective argument aren't recommended.

Related

@kidehen

Yeah, I'm pretty sure we have a strong agreement on a lot of this.
(Sorry, I should have been clearer on that in my post.)

I see no need to actually change RDF.
What I would like to see is an evolution of the way RDF is used.

I accept that there is stuff that some people want to say that needs existential quantification.
This is what I meant when I said to use them "when I can't logically specify a well-founded ID".
In my case, I never have the need to do that - that's of course because of the sort of systems we build. Which is because I am working with facts, not knowledge.
And my observation is that the developers that @dbooth-boston is talking about are unlikely to want that, even if they understand the nuances of it.
I think that what we want is RDF as the go-to for representing facts - it is already a possibility for KR stuff, but KR developers are not the middle 33% of developers (are they even 3%?).

So I want to see the use of blank nodes discouraged, rather than where they are used as what often seems the go-to answer to not being bothered to actually say what you want in the RDF.

Mind you, for your example, I don't see which of the blank nodes you have could not have an ID.
It may be that there was some RDF that you used to construct the above RDF, but the RDF you give is exactly the situation where I would discourage the use of blank nodes.
Because:
a) I can't really add to the knowledge - how would I say that the cat is Siamese, for example, or the mat is made of cotton.
b) I can't "disagree", and say that the mat looks grey to me.
And you as publisher can't do that in the future either, so it really doesn't make for a very maintainable bit of knowledge.
Best

@draggett
Yes, we have that flexibility in RDF, and that's fine.
But it doesn't mean we should prefer to use it.
RDF is not natural language, and is not used the same way.
My ability to coin a new, unique word in RDF far exceeds my ability to coin a new, unique word in English.
I suggest that the widespread use of blank nodes is a bit like talking to someone with Anomic Aphasia; and can be very frustrating - just like trying to consume and use someone else's RDF that uses lots of blank nodes.
Have you not had that problem when consuming other people's stuff?

If I stretch the natural language parallel further, to say:
"the third door on the left", and we will agree to call that "foo"
it is so much more useful.
I can phone you up later and say "foo is painted blue".

In NL we don't usually bother to name things because the overheads are not so much work.
Using a computer in RDF, the overheads usually much higher, and so it can be very valuable to speculatively name things.

[I view it like the overheads of doing eager evaluation in a lazy evaluation context, in fact.
In normal human interaction you take the potential future hit of not creating an ID for that door because you may not need to, and the overheads of re-identifying the door in normal conversation are low (I mean, how many doors have we been talking about?).
In a 'puter, when trying to keep track of significant numbers of doors that have no names, the overheads of reliably re-identifying a particular door can be enormous, so eager evaluation is much preferred. There are reasons why SK-combinator machines etc. run with the speed of continental drift (to quote David Turner).
And no, I am not proposing we do strictness analysis, abstract interpretation, lattices etc. on blank node naming ;-)
]

I've opened issue #48 to explore to take a step back from the specifics of blank nodes, to discuss different kinds of identifiers in terms of how they are used as requirements for the underlying framework.

@HughGlaser,

So I want to see the use of blank nodes discouraged, rather than where they are used as what often seems the go-to answer to not being bothered to actually say what you want in the RDF.

Remember, you can apply reasoning and inference to produce 5-Star Linked Data (on a forward-chained basis) from basic RDF that may or may not be littered with Blank Nodes.

Blank Nodes are a key part of RDF flexibility. Like most things, misuse leads to problems. This is why I look at all of this through the "horses for courses" context-lenses .

I think we both agree that RDF is good as is i.e., it doesn't need to be tinkered with, especially not on the basis of compatibility with a fundamentally different approach to modeling as espoused by "Property Graphs" (a totally confusing moniker to me!).

Personally, I believe we just need more tooling, educational collateral, dog-fooding, and cooperation :)

I think we both agree that RDF is good as is

Good yes, but not good enough. To quote: "1. The goal is to make RDF -- or some RDF-based successor -- easy enough for average developers (middle 33%), who are new to RDF, to be consistently successful. 2. Solutions may involve anything in the RDF ecosystem: standards, tools, guidance, etc. All options are on the table. 3. Backward compatibility is highly desirable, but less important than ease of use."

We shouldn't be banning stuff that exists in the existing RDF standard just because a specific programmer or end-user profile struggles with its essence and/or utility.

Agreed. But we should consider deprecating something if: (a) it raises the entry barrier to RDF adoption; and (b) reasonable alternatives are available.

Please do not conflate blank nodes (as existential variables) with the syntactic conventions of () and [] in Turtle that currently generate implicit blank nodes. I fully agree that we need those syntactic conveniences. But we do not necessarily need the underlying blank nodes that are generated. There is ample evidence showing that blank nodes as existential variables are not actually needed in the vast majority of cases: URIs could be used instead.

Obviously we would not want to force users to manually create a URI everywhere they wish to use () or []. That would be far too tedious. But just as tools auto-generate blank node labels, they could auto-generate URIs similarly. Details would have to be worked out, of course, but it is a realistic possibility.

But I think the most important point around blank nodes is that users should not have to ever think about them or know about them. If blank nodes exist at all, they should be invisible to the user.

Hi @kidehen

I think we both agree that RDF is good as is i.e., it doesn't need to be tinkered with, especially not on the basis of compatibility with a fundamentally different approach to modeling as espoused by "Property Graphs" (a totally confusing moniker to me!).
Well, I wouldn't mind some tinkering :-), but I think that is probably not the best way to go to achieve the objectives we are discussing.

Personally, I believe we just need more tooling, educational collateral, dog-fooding, and cooperation :)
Certainly a good suggestion.

I think the problem can then be characterised as "what is the content of the educational collateral"?

I thought I would look at
https://en.wikipedia.org/wiki/Semantic_Web
to see what it said.
It has an example of RDF (from RDFa).
And what is the RDF fragment about? Who knows? It's a blank node.
To be very detailed (in my understanding):
It represents the idea of a Paul Schuster in Dresden etc..
That is not helpful to a developer trying to do stuff with data.
I might be able to find a Person called Paul Schuster who lives in Dresden, but this graph can never tell me anything extra about any Paul Schuster, because either all the properties are the same (in which case it might be the same one the publisher was thinking of), or they are different, in which case I know the very small fact (which is probably not useful) that my Paul Schuster is not the one the publisher was thinking of (edit: if there are some appropriate OWLish rules, I am guessing, but maybe not even then.)
So for most practical purposes that graph is a waste of space - the very thing it is all about is uncertain.
What I want is that people understand that RDF like that (although legal etc.) is not the best way to do it.
If the RDF example there had a decent URI for the person, it would be much clearer to someone trying to learn and understand and then use RDF, what they might be able to achieve with it.

@dbooth-boston

There is ample evidence showing that blank nodes as existential variables are not actually needed in the vast majority of cases: URIs could be used instead.

I totally agree with that, and that exactly is why splitting RDF into profiles is a valid idea, which would lower the entry barrier both for consumers as well as tooling producers.

Something in the line of RDF 2.0 with an opt-in RDFbn profile, which would itself provided backwards compatibility for RDF 1.1.

@kidehen

Personally, I believe we just need more tooling, educational collateral, dog-fooding, and cooperation :)

I agree that the lack of tooling is one of the main issues, but a high entry barrier together with big issues (non-referable "Linked Data", non-polynomial processing times for basic business use cases) produced by the current blank nodes approach hinder (imo) the development of such tools.

I think we both agree that RDF is good as is

Good yes, but not good enough. To quote: "1. The goal is to make RDF -- or some RDF-based successor -- easy enough for average developers (middle 33%), who are new to RDF, to be consistently successful. 2. Solutions may involve anything in the RDF ecosystem: standards, tools, guidance, etc. All options are on the table. 3. Backward compatibility is highly desirable, but less important than ease of use."

I accept the notion that RDF is a problem.

We are inadvertently blaming RDF for the issues arising from the missing RDF Applications Web (or Knowledge Graph).

There are no tweaks to RDF that will fix the issue outlined above. Every other technology that's negated the problems afflicting RDF have done so via Application Directories and Catalogs combined with a library of educational literature.

I am not speculating here, I am speaking from experience over the last 24+ years. There is a set pattern which simply hasn't happened with RDF en masse:

  1. Introduce Technology
  2. Produce and Distribute Educational Literature
  3. Develop Applications covering developer, end-user, educator profiles.

We shouldn't be banning stuff that exists in the existing RDF standard just because a specific programmer or end-user profile struggles with its essence and/or utility.

Agreed. But we should consider deprecating something if: (a) it raises the entry barrier to RDF adoption; and (b) reasonable alternatives are available.

Deprecating is basically banning in my world.

Why can't we let stuff evolve naturally? For instance, those who don't want to work with Blank Nodes simply don't use them.

Please do not conflate blank nodes (as existential variables) with the syntactic conventions of () and [] in Turtle that currently generate implicit blank nodes.

You've lost me on that one. My blank node examples include pictorials. RDF-Turtle is just a preferred notation I use for my examples.

An RDF-processor translates RDF sentences crafted using a notation. That's what I demonstrate with our OSDS tool.

I fully agree that we need those syntactic conveniences. But we do not necessarily need the underlying blank nodes that are generated.

See my comment above.

There is ample evidence showing that blank nodes as existential variables are not actually needed in the vast majority of cases: URIs could be used instead.

Obviously we would not want to force users to manually create a URI everywhere they wish to use () or []. That would be far too tedious. But just as tools auto-generate blank node labels, they could auto-generate URIs similarly. Details would have to be worked out, of course, but it is a realistic possibility.

But I think the most important point around blank nodes is that users should not have to ever think about them or know about them. If blank nodes exist at all, they should be invisible to the user.

"Horses for courses" is a powerful feature of RDF. We shouldn't tamper with this. IMHO.

Let me further explain this:

Please do not conflate blank nodes (as existential variables) with the syntactic conventions of () and [] in Turtle that currently generate implicit blank nodes.

You've lost me on that one.

The syntactic conventions of () and [] in Turtle are a very important convenience. Currently they generate implicit blank nodes at the triple level -- "implicit" because they have no visible label at the Turtle level, in contrast with explicit blank nodes such as _:b1 . Two important things to note about implicit blank nodes:

  • they could just as well produce URIs at the triple level, instead of blank nodes; and
  • implicit blank nodes are not the ones that cause the major difficulties. Explicit blank nodes (like _:b1) are the ones that lead to blank node cycles in the graph, and prevent predictably efficient canonicalization and processing.

Furthermore, URIs are just plain better than blank nodes in all but a vanishingly few cases. They are stable names that can be used reliably in follow-up SPARQL queries, and they prevent duplicate (non-lean) triples when the same data is loaded twice.

In other words, the convenience that we crave is not because blank nodes provide existential variables. It is almost entirely the convenience of the syntactic conventions of () and [], which actually have nothing to do with blank-nodes-as-existential-variables. If we tease these features apart, I believe we can have our cake and eat it too: the convenience of () and [] without the problems that unrestricted blank nodes bring.

@dbooth-boston ,

In other words, the convenience that we crave is not because blank nodes provide existential variables. It is almost entirely the convenience of the syntactic conventions of () and [], which actually have nothing to do with blank-nodes-as-existential-variables. If we tease these features apart, I believe we can have our cake and eat it too: the convenience of () and [] without the problems that unrestricted blank nodes bring.

I don't know why you are assuming that my point is about syntax. My point is about the fact that I can whimsically scribble RDF sentences on-the-fly. Likewise, I can construct transformations informed by inference rules as and when required.

This discussion is a classic example of issues arising from the lack of awareness that afflicting existing RDF productivity tools which creates the illusion of non-existence.

You've made assumptions about my intentions because my intentions aren't clear to you.

Everything you've described about producing URIs from sentences crafted using RDF-Turtle notation ultimately belongs to the Applications rather than Language (and associated inscription notations and content serialization formats) bucket.

RDF isn't the problem.

The problem with RDF boils down to difficultly finding Applications (various genres) and Educational Literature (for various audiences) that help folks better understand and appreciate its value proposition.

BTW -- Nothing stops the creation of a a new research area that creates something like a Markdown for RDF. That's where a lot of these stylistic issues belong. IMHO.

Related

  • Early comment with examples -- demonstrating how RDF enables better understanding of conversations about logic and existential quantification without introducing the immediate need for URIs (be it urn: or http: or any other scheme based)
  • OpenLink Structured Data Sniffer -- Example of an RDF productivity tool for content discovery, transformation, and exchange
  • OpenLink Structured Data Editor -- Example of an RDF productivity tool for content editing
  • YouID -- Example of an RDF productivity tool for credentials generation
  • URIBurner -- Example of an RDF-based service for generating 5-Star Linked Data from a wide variety of data sources
  • dokie.li -- Example of an RDF productivity too for annotations, reviews, social-sharing, content editing etc.
  • And many more.. that folks simply aren't aware of because we don't have an applications and services bubble in the LOD Cloud

I think this is an important comment about "deprecation" as a term:

Agreed. But we should consider deprecating something if: (a) it raises the entry barrier to RDF adoption; and (b) reasonable alternatives are available.

Deprecating is basically banning in my world.

Why can't we let stuff evolve naturally? For instance, those who don't want to work with Blank Nodes simply don't use them.

We need to be clear about terms.
In my world, deprecating is not banning - it is more a way of recording what is a preferred term or way of doing things and yes, discouraging other ways and terms.
It is about letting stuff evolve naturally, which is what you say in the next sentence, @kidehen , although with some evolutionary pressure to go one particular way.
Once things have evolved naturally, there comes a point at which you need to capture that new state - ultimately that leads to Standards, but before that there are Best Practices and then Deprecation, perhaps.
You don't get "banning" until you have Standards.

Of course, I may be the odd one out - I would like to know if I am, please.

In my world, deprecating is not banning - it is more a way of recording what is a preferred term or way of doing things and yes, discouraging other ways and terms.

Every time I've encountered deprecation in the real-world it has amounted to banning i.e., newer tools treat what's deprecated as invalid which breaks existing stuff.

Personally, speaking about "best practices" for using tech in a specific context is much safer. For example, you don't need Blank Nodes when publishing Linked Data if the use-case is dataset publication.