terminusdb / terminusdb-store

a tokio-enabled data store for triple data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support wildcard and superfluous deletes

matko opened this issue · comments

Currently deletes are per-triple. This means that deleting all triples for a particular subject has to be done by querying that subject's triples, and deleting them all individually. Besides being an annoying way of deleting, it also leads to a duplication of all the data being deleted, and we lose semantic value, as we do not save the fact that we deleted everything related to a particular subject.

I propose we implement wildcard deletes. Wildcard deletes would assert that for a particular subject, predicate, or object, or for a particular subject-predicate or predicate-object pair, all triples have been deleted in a particular layer.

Additionally, it should be possible to insert such a wildcard delete even if there's no previous insert that'd match the wildcard. This would allow us to shortcut many queries searching for data by specifying at a certain layer that this data will not be found no matter how deep the query drills into the layer stack.

I'm not sure if we can implement wildcard deletes with the present structures or if a new set of structures will be needed.

Reasoning from terminusdb, the most interesting wildcard deletes would be subject, object, and subject-predicate.

  • When deleting a particular document, we always want to insert a wildcard delete for that document id as subject and a wildcard delete for that document id as object.
  • When replacing a particular document, we want to insert a wildcard delete for that document id, followed by inserts for the whole document. Any existing triples with this document in the object position remain valid however.
  • When replacing or deleting a particular property of a document, we want to insert a wildcard delete for the subject-predicate pair.

less useful:

  • wildcard predicate-object deletion has limited use, such as when deleting all objects of a particular type. This may also come in handy when changing the key strategy of a type, requiring all ids of that type to be regenerated. Any queries that need to list all objects of a particular type can then know to look no further than a particular layer.
  • wildcard deletes on predicates are less useful. We hardly ever query on predicates alone. So even if we delete all triples of a particular property, the wildcard delete would not speed up anything. Furthermore, this is simply not a delete we ever do semantically.

I think this is a really good idea. When looking up change information however, we'll have to look deeper down the layer stack to find the answer, but since this is likely to be a lot less common (and a lot more focused) I think this makes sense. It should speed up both search for documents structure (by allowing us to know for sure that nothing is lower in the stack) and deletion.

For the eventual use of content-addressable-hashing, will we need some encoding for the stars, or do we look up the answer for the delta and use that? I'm thinking that there is no advantage to the later, but maybe I'm missing something.