BACKEND: Blank nodes should not match existing data

Question

BACKEND: Blank nodes should not match existing data

gsvarovsky opened this issue 4 years ago · comments

When inserting data containing blank nodes, the blank subject or object is stored verbatim with the same blank node identifier as the input. This breaks the requirement that blank nodes are scoped to the input document. For example (I tried adding this as a unit test in quadstore.prototype.put.js):

    it('should not re-use blank nodes', async function () {
      const { dataFactory, store } = this;
      await store.put(dataFactory.quad(
        dataFactory.blankNode('_:s'),
        dataFactory.namedNode('ex://p'),
        dataFactory.namedNode('ex://o'),
        dataFactory.namedNode('ex://g'),
      ));
      await store.put(dataFactory.quad(
        dataFactory.blankNode('_:s'),
        dataFactory.namedNode('ex://p'),
        dataFactory.namedNode('ex://o'),
        dataFactory.namedNode('ex://g'),
      ));
      const { items: foundQuads } = await store.get({});
      should(foundQuads).have.length(2);
    });

This test fails because the two invocations of put are using the same blank node label. Instead, they should result in different quads with disjoint subjects.

For more complex examples, such as lists, the accidental re-use of blank node identifiers (for example after a re-start) could badly affect data integrity.

Jacopo Scazzosi · Answer 1 · Mon Dec 07 2020 16:50:10 GMT+0800 (China Standard Time)

I just asked on the RDF/JS gitter room about this. I don't remember reading anything about blank node collisions in the low-level spec but, ideally, this is something that all implementations should address in a uniform way.

As for the rest of the quadstore API, what would be the scope of our "collision avoidance" strategy? Collision avoidance in a single .multiPut() call seems obvious but what about collision avoidance across multiple individual .put() calls?

I suspect that this would be a good use-case for something like what Node.js has done for http agents and TCP connection pooling (https://nodejs.org/dist/latest-v14.x/docs/api/http.html#http_new_agent_options and https://nodejs.org/dist/latest-v14.x/docs/api/http.html#http_http_request_url_options_callback).

We could modify all write methods to receive an optional blankNodeCollisionAvoidance object, itself an instance of the BlankNodeCollisionAvoidance class, that would be used by the store to process blank node labels into non-colliding labels. This object would act like a cache of "known" labels.

In your example, omitting this object or passing two different instances would lead the store to process those labels into non-colliding ones. However, passing the same instance to both .put() calls would lead the store to process both s labels into the same final label.

Martynas Jusevičius · Answer 2 · Mon Dec 07 2020 16:55:23 GMT+0800 (China Standard Time)

Can't you have internal IDs for blank nodes like Jena does?
https://jena.apache.org/documentation/javadoc/jena/org/apache/jena/rdf/model/AnonId.html

Jacopo Scazzosi · Answer 3 · Mon Dec 07 2020 17:07:40 GMT+0800 (China Standard Time)

@namedgraph thank you for pitching in! Yes, I think we'll end up storing both some sort of internal id plus the original label to preserve the latter while avoiding collisions through the former. However, I'd like to provide a mechanism allowing for re-utilization of the same internal id across different writes when needed.

For example, it is likely that importing from a stream will require blank nodes with the same label to end up having the same internal id, even though (in our case) importing from a stream happens through separate writes of one or multiple quads. In this case, we would need to find a way for quadstore to remember both the label and internal id of previously written blank nodes so that encountering the same label in a different write would lead to the same internal id being used.

Martynas Jusevičius · Answer 4 · Mon Dec 07 2020 17:09:18 GMT+0800 (China Standard Time)

Is it necessary to remember the original label? I don't think Jena does that. Nor any other store that I can remember really.

Jacopo Scazzosi · Answer 5 · Mon Dec 07 2020 17:12:00 GMT+0800 (China Standard Time)

True, we only need to remember original labels insofar as we're looking for them while performing further write operations. We don't need to store them as returning them could lead to the very collisions we're trying to avoid.

George Svarovsky · Answer 6 · Mon Dec 07 2020 23:30:16 GMT+0800 (China Standard Time)

Personally I think the first incremental step is for scope to exactly track API calls, with no changes. So:

every put has its own scope (is independent), so re-using blank nodes in quad arguments to separate calls results in different internal ids.
multiPut defines a scope, so quads in the array argument can share blank node names to create structure.
putStream is not atomic, so the only sane thing to do is have every quad be its own scope. This precludes creating structure with blank nodes, but if needed you can implement your own batching with multiPut.

IMHO it's dangerous to separate atomicity from scope – you could end up in a big pickle with errors and crashes.

Jacopo Scazzosi · Answer 7 · Tue Dec 08 2020 01:11:16 GMT+0800 (China Standard Time)

putStream is not atomic, so the only sane thing to do is have every quad be its own scope. This precludes creating structure with blank nodes, but if needed you can implement your own batching with multiPut.

This would break the fairly common use-case of streaming quads from a file into a store. I do share your concern WRT separating atomicity from scope but I think that putStream defines a scope just as much as multiPut does from the perspective of an outsider.

Ruben Taelman · Answer 8 · Tue Dec 08 2020 14:56:54 GMT+0800 (China Standard Time)

@gsvarovsky In your example, you're actually creating named nodes instead of blank nodes.

dataFactory.namedNode('_:s')

should become

dataFactory.blankNode('_:s')

But even then you'll probably get just a single result as test result.

This does seem like an expected outcome to me though.
I would consider multiple calls to this low-level put as if they are happening in the same document scope.
Which means that blank nodes can be used across quads. (Because this is the only way to connect bnodes across quads with each other)

You could easily fix your case by calling the following to create blank nodes with unique labels:

dataFactory.blankNode()

Alternatively, a higher-level insertion mechanism like SPARQL update could be used, which takes care of bnode scoping.

George Svarovsky · Answer 9 · Tue Dec 08 2020 16:08:12 GMT+0800 (China Standard Time)

you're actually creating named nodes instead of blank nodes.

Whoops! corrected by edit. Yes, the outcome is the same.

If every call to put has the same document scope, then how do you define a new scope when you need it?

This would require a new API feature, like an inverse of @jacoscaz's blankNodeCollisionAvoidance object. My point is that such an API needs to be carefully specified so that it's clear how it relates to transaction atomicity. If I'm in the middle of creating structure using blank nodes and puts, how do I sanely recover after a process crash? – on restart I have no way to reference the blank nodes used in the partially-created structure. The case of streaming from a file has this problem.

Perhaps another approach would be to offer an explicit transaction API like Jena. A transaction is both atomic, and defines a document scope for blank nodes. Internally this would use a sustained Leveldown chainedBatch. Using any of the available operations within a transaction would contribute to the batch, which is stamped to the backend on commit. (A downside is that such a 'transaction' would not allow you to read your own writes during the transaction, but perhaps this is a worthwhile compromise.)

However this still doesn't fully solve the file streaming case, if it's a big file that doesn't fit in memory and so must be processed in multiple transactions. For this case I think skolemization is the best approach. The file reader replaces each blank node with a genid during streaming, and maintains its own map of blank node label to genid. It is able to ensure that this map survives a restart by whatever means available (including in the backend using a preWrite).

Jacopo Scazzosi · Answer 10 · Tue Dec 08 2020 18:02:13 GMT+0800 (China Standard Time)

Thank you @gsvarovsky, @namedgraph and @rubensworks for pitching in!

Based on your arguments, I would be inclined to do the following:

keep the store as it is, with all calls effectively happening in the same document scope;
implement scoping at the RDF/JS level (rather than at the store level) in a separate library.

The final result would be something like the following:

const scope = scopingLibrary.createScope();
const scopedQuads = scope.process([ /* RDF/JS quads */ ]);
store.multiPut(scopedQuads);

const scope = scopingLibrary.createScope();
store.putStream(scope.createProcessingStream(rdfjsQuadStream));

What do you think?

Jacopo Scazzosi · Answer 11 · Tue Dec 08 2020 18:10:11 GMT+0800 (China Standard Time)

The scoping library could even be designed in such a way as to be able to bootstrap scopes from a store and serialize scopes to RDF/JS quads to be persisted to the store atomically with the quads being processed.

George Svarovsky · Answer 12 · Tue Dec 08 2020 18:18:26 GMT+0800 (China Standard Time)

Interesting. Perhaps go even further. Should scope be a first-class citizen in rdfjs? In my recent travels I have been frustrated by this concept being not well defined (of course, I could just have missed some important reference). Is it worth raising this with the wider community?

https://www.w3.org/2011/rdf-wg/wiki/User:Azimmerm/Blank-node-scope
https://www.w3.org/2011/rdf-wg/wiki/User:Azimmerm/Blank-node-scope-again

George Svarovsky · Answer 13 · Tue Dec 08 2020 18:40:55 GMT+0800 (China Standard Time)

As a concrete use-case, for consideration.

As an application, I generate a JSON-LD document containing a sub-structure defined as a @list. (Note that my JSON document does not contain blank nodes.)

I process a document using a JSON-LD processor, which generates an RDF list, containing blank nodes. I use quadstore to, erm, store the quads.

Then, I restart, and do the same operations with a new JSON-LD document in a new session. The same blank node labels are generated, and the list data from the first document is corrupted.

(I am just starting to work on list support in m-ld, and I may force skolemization, so I may not need any special support from quadstore. I will keep you updated if any definite requirements arise. Thanks, as always, for your collaboration!)

Jacopo Scazzosi · Answer 14 · Tue Dec 08 2020 22:24:24 GMT+0800 (China Standard Time)

@gsvarovsky I think that your JSON-LD example is a perfect representation of how the RDF ecosystem can often feel counter-intuitive for those who come to it from a non-academic background (like myself).

Reading those proposals makes me more convinced about my own proposed solution as I think the best way to counter the lack of a clear definition of blank node scoping is forcing developers to explicitly define their own scopes whenever needed. Making scoping as explicit as possible would lower the cognitive barrier to entry IMHO.

EDIT: I hadn't realized that the expression "utterly bananas" could be interpreted to have racist undertones - oops!

Jacopo Scazzosi · Answer 15 · Thu Dec 31 2020 05:58:41 GMT+0800 (China Standard Time)

@gsvarovsky when you have a moment, could you please have a look at the quad-scoping branch? The createScope() method documented at https://github.com/beautifulinteractions/node-quadstore/tree/quad-scoping#blank-nodes-and-quad-scoping should provide a decent solution to this issue.

George Svarovsky · Answer 16 · Thu Dec 31 2020 18:49:39 GMT+0800 (China Standard Time)

Looks elegant, @jacoscaz. Some thoughts:

1. Index size

Each nanoid is 21B, +2 for _:, × 6 indexes = 138B.

2. Restart

If I'm in the middle of creating structure using blank nodes with a scope, and the process crashes, I am in a big pickle. Even if I have tracked my position in the data upload, I don't know what blank node identifiers were used.

I can't re-start the upload because the remaining data will not link to the existing data.
I can't revert the upload because I can't (necessarily) find the data.

3. Export

Internal blank node identifiers are exposed when reading from the quadstore. This makes them effectively skolemised, because they can be used in new data to link to existing data. However, if you use a scope when inserting the new data, they will lose their identity again. At this point, intuition has taken many steps in its long walk on a short pier.

4. Default Scope

The regular write methods don't use a scope, so the blank nodes go in verbatim as before. This means that if there is any chance of blank nodes in your dataset, you have to be very careful to read the scope documentation. In other words, the default behaviour is still incorrect IMO. On the other hand, since blank nodes are so ugly already maybe this is fine.

Ideas

Use an incrementing integer, stored as a plain key-value and updated with every batch, to generate internal blank node identifiers. Maybe optional, to trade write performance against storage (would need to be tested). Some care would be needed in case of concurrent writes.
Apply a scope to the regular write methods if one is not provided – needs a small change in the API, I think.
Apply a scope to read methods too, which generates new blank nodes that have no correspondence with internal ones – in other words, hide internal blank nodes completely.
Make scopes themselves persist-able (optionally) so that worriers like me can safely recover from crashes.

Martynas Jusevičius · Answer 17 · Thu Dec 31 2020 19:31:01 GMT+0800 (China Standard Time)

@afs might have some insight here.

TL;DR - don't like how bnodes work - don't use bnodes :)

Jacopo Scazzosi · Answer 18 · Fri Jan 01 2021 02:16:19 GMT+0800 (China Standard Time)

Hi all!

@gsvarovsky:

Use an incrementing integer, stored as a plain key-value and updated with every batch, to generate internal blank node identifiers.

A nanoid-labeled blank node is still significantly smaller than the average named node and seems to be comparable to shortened named nodes when using prefixes. I don't think slightly longer blank nodes are likely to become an issue on their own unless as a part of a bigger issue related to the comparatively low quad/MB ratio that can be achieved using quadstore's indexing strategy.

Apply a scope to the regular write methods if one is not provided – needs a small change in the API, I think.

I do agree that the default behavior is not correct but it's also simple to maintain, easily understood and easily extendable. Furthermore, I suspect that it matches expectations of how a low-level RDF/JS library should work as per @rubensworks comment. I think that forcing a scope when none is provided would break a lot of assumptions, both spoken and unspoken.

Apply a scope to read methods too, which generates new blank nodes that have no correspondence with internal ones – in other words, hide internal blank nodes completely.

I agree in principle but I can't come up with a sane way to do this without adding unreasonable amounts of complexity.

Make scopes themselves persist-able (optionally) so that worriers like me can safely recover from crashes.

At what point should a scope be persisted? For example, imagine we're import-ing a stream. The scope would have to be persisted to disk whenever its internal cache is updated, which would mean serializing its entire cache quite frequently... Actually, now that I think of it, the scope could be persisted to disk incrementally, with each newly-cached blank node persisted in the (K, V) form (scope-<scopeId>-<originalLabel>, <newLabel>) in the first batch operation that contains it.

In any case, preWrite should make this relatively easy, although I suspect that persist-able scopes would benefit from a (much) more integrated API.

const scopeId = await store.createScope(); // inits a new scope
const scopeId = await store.loadScope('some-id'); // re-hydrates a previously-created scope
await store.putStream(stream, { scope: scopeId }); // updates the scope with each new blank node
await store.multiPut(quads, { scope: scopeId }); // updates the scope with each new blank node
await store.deleteScope(scopeId); // drops the scope

Does it even make sense to provide scoping support without persist-able scopes?

TL;DR - don't like how bnodes work - don't use bnodes :)

I think this is a valuable suggestion, @namedgraph. It could be that scoping is simply too dependent on each specific use-case to be easily implemented in a low-level library such as quadstore.

Jacopo Scazzosi · Answer 19 · Fri Jan 01 2021 03:36:15 GMT+0800 (China Standard Time)

WRT to a more integrated API, my example works better with explicit scope objects:

const scope = await store.createScope(); // inits a new scope
const scope = await store.loadScope('some-id'); // re-hydrates a previously-created scope
scope.id; // can be used as a reference to re-hydrate the scope through store.loadScope()
await store.putStream(stream, { scope }); // updates the scope with each new blank node
await store.multiPut(quads, { scope }); // updates the scope with each new blank node
await store.deleteScope(scope); // drops the scope, can also accept a scope id

Jacopo Scazzosi · Answer 20 · Sun Jan 03 2021 02:45:15 GMT+0800 (China Standard Time)

Had a bit of time today so I decided to give a go at the API from my previous comment, addressing what I think is the most critical point:

Make scopes themselves persist-able (optionally) so that worriers like me can safely recover from crashes.

I ended up using something very similar to what preWrite does but in a less generic way due to performance concerns, the core of happens here https://github.com/beautifulinteractions/node-quadstore/blob/e3362a85fa24d4e93a49b6a3e432ac092dac340e/lib/scope/index.ts#L82-L90 .

I am surprised - this has basically no effect on import performance but it still allows scopes to be reloaded at a later time without issues (https://github.com/beautifulinteractions/node-quadstore/blob/e3362a85fa24d4e93a49b6a3e432ac092dac340e/README.md#quadstoreprototypeloadscope ).

@gsvarovsky when / if you have a moment, your feedback would be most welcome.

George Svarovsky · Answer 21 · Mon Jan 04 2021 17:11:44 GMT+0800 (China Standard Time)

Hi @jacoscaz, great news that the persistent scope is not a significant performance bottleneck. It looks great & the API with the scope in the opts object makes sense.

Just for your interest (I should have mentioned it before) m-ld deals with a similar situation. Nothing to do with blank nodes (we skolemise), but in a replicated dataset, operations can be incoming from other clones at any time. We therefore provide an API that holds the current state as immutable, to allow the app to make a consistent set of edits. This 'immutable state' is captured in the API as an interface. In principle this is similar to a scope – a way of bounding operations.

The way this is arranged in m-ld is to have the scope-like MeldState itself express the data operations. You obtain an immutable state by calling a method on the main clone (store-like) object, which takes a callback argument.

The significant idea is that the clone/store itself implements the data operations too, for simple use-cases. So you have the choice whether to just make individual operations on the mutable clone, or use an immutable state.

Just a thought. The current API makes sense and seems very usable.

Jacopo Scazzosi · Answer 22 · Mon Jan 04 2021 17:37:40 GMT+0800 (China Standard Time)

@gsvarovsky if I understand correctly, what you're describing is similar to LevelDB's snapshotting feature, which some implementations of AbstractLevelDOWN (such as leveldown) use when iterating through the store to provide consistent reads.

Very cool to see that you've replicated such a feature at the application level and in a distributed manner! Thank you for mentioning it, this might come useful in the future. For the time being, I'm happy to piggy-back on the AbstractLevelDOWN API.

Jacopo Scazzosi · Answer 23 · Tue Jan 05 2021 19:34:46 GMT+0800 (China Standard Time)

Published in version 7.3.0!