typedb / typedb

TypeDB: the polymorphic database powered by types

Home Page:https://typedb.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TypeDB 3.0 Roadmap

flyingsilverfin opened this issue · comments

Problem to Solve

We collect the agreed list of changes and requirements that will be in the first version of TypeDB 3.0.

Changes

API

Driver

TypeQL

Value restriction:

Require further discussion:

Relation implementation

Changes proposed and rejected:

  • Immutable relations
  • #6770

Let's make sure each of them is documented properly in an issue, @flyingsilverfin

Yes @haikalpribadi that's what the colons are for :D I have to get to that next

Internal changes

A ? indicates not yet fully discussed.

TypeQL

  • Fix backslash escaping
    ? - [ ] Allow modifiers on match inside of a delete/insert query to allow flexible query operations such as batching
    ? - [ ] Rename MatchQuery to GetQuery. To discuss: how this plays with the above

RPC

  • Replace the session ID with a long, instead of an inefficient vector

Pattern & Resolvables

  • Implement the query representation as a set of constraints that own variables, instead of the other way around
  • Remove the idea of Resolvables (eg. Concludables, etc.) and merge them into Patterns

** Concepts **

  • The schema concept layer should more aggressively cache, in more CPU friendly format, various shortcuts such as as owned attribute types directly, without having to traverse through the super types as well. Additionally, we should store all schema-level data in flat sorted arrays (likely never exceeds about 100mb in size with the largest possible schemas) to optimise access.

Traversal & Reasoner

  • Rearchitect reasoner to manage its own memory
  • Push Concept to be the bottom layer of the database that traversals and reasoner operate over. Below that can still exist a graph and storage layer, but they should not be exposed
  • Convert explain into a explain() query that takes a query and bounds. Alternatively, we could just explain the existence of an inferred concept without a query? Also, convert explanations into something more native?
  • Handle negations in the traversal natively

Thank you guys for working on this and sharing it with us.
The two key features that extremely limit our use cases and I'm missing in this list:

Optionals/fetch

#6322
Including a way to have optional played roles also, not only attributes.

Vectors and ordered lists

I see they've been discarded :S, but there is no simple workaround for this:
#6327

Vectors

We don't need them as a particular attribute types, maybe a @sortable or @indexable when defining relations would work.

Storing something like this in typeDB is really hard, and mutate it (add items at particular positions) in a performant way is near to impossible
image

Ordered lists

This also includes ordered lists with repeated values which are really hard to store in typeDB.
Ex: [1,2,2,3,7,2] or ['blue', 'green', 'green', 'red']

Hi All,

After speaking to Haikal, there are good reasons to move from the Concept API to Fetch, particularly speed. At present it takes between 2-5 secs to retrieve an object from TypeDB and transpile it to valid Stix JSON. This is mainly due to all of the network roundtrips that have to be done, so clearly one fetch query will be more effective.

The advantage of our current system is it is shape-based, so I can handle all JSON objects using the same ORM, the disadvantage is speed.

The new approach does mean a lot more code, since we have to build quite long Fetch statements for each individual object (e.g. 16-44) lines for each of our 85 objects, and then build the transpile code (from returned Fetch JSON to Stix JSON). This figure assumes a single main object, 4 lines, and then 3-11 optional sub objects with relations, with 4 lines each, if we use the class hierarchy. But the benefit will be far greater speed, totally agreed.

We probably wont be able to make this move for some months, due to resourcing, but we agree it will be worth it. At the same time we can update our 2.500 lines of schema code to v3. This will place us in good position to add on another 50-80 cybersecurity objects (e.g. SBOM's, Vulnerabilities, Risk etc.)

Onwards and upwards for TypeDB and our cybersecurity application!!

I hope we will get the same tree structure for mutations. Batch mutations and optional mutations are currently a nightmare, while queries with fetch are so smooth.

A point of enhancement could be to be able to use multiple match fetch in the same query, and same for the mutations, instead of having a single entry point.

This is possible in the nested branches, we can open multiple ones and asign them to different keys, but it is not possible to have multiple keys at the root level.

Another key conceptual blocking point in mutations for us is how cardinality MANY is handled. Whenever the match clauses start doing permutations, the insert / delete are run as in a FOR loop.

This issue has an example of one insertion that is run N times against intuition:
#6902

In 3.0 I would love to see $vars being aware of their cardinality. The way that dgraph executes this type of mutations is really intuitive, each variable holds and array of iids, so if a match does something like this

match
$jobPosition isa jobPosition has id 'frontendDeveloper';
$candidate isa Person, has name 'Junior Peter':
$allInterviewers isa Person, has departMentName 'IT':

insert
$selectionProcess ( candidate: $candidate, job: $jobPosition, interviewers: $allInterviewers) has id 'selectionProces1':

This would be run a single time and create a sinfle selectionProcess as expected.

Alternatives:
a) in order to enable FOR loops as they happen now, new syntaxis for loops could be created, which are more rare cases.

b) Another alternative would be to indicate the type of cardinality in the roles when defining the schema, so we now which things are treated as arrays and store multiple iiids in the $var, and which things follow current behaviour

c) Yet another alternative could be to clearly define array variables, for instance doing []allInterviewers isa .... instead of $allInterviewers isa ....

v3.0 is looking awesome, but can you also detail TIME and GPS please

V3.0 a pretty massive rewrite, and in fact we may probably reengineer the schema, since originally Tomas adopted the Vaticle style guide, and made all of the property names different from the TypeDB ones. The consequence of this is that Fetch statements must be super long to include every property, every sub-objects and all of its properties. If the variable names are the same then Fetch would be more powerful and concise.

Still, the powerful new capabilities of v3.0 make it worth this re-engineering, as long as TIME and GPS are sorted. Please provide architectural best practice for these two, thanks

So after lot of thought Im changing my wishlist priorities. My key needed feature is being able to share $vars between different streams. This would fix almost every issue we are facing with mutations and is something enabled in most databases.

As an example:

startTX

   insert
      $b isa Book, has id 1
    ---
    match
       $allAuthors isa Author
    ---
    insert
      $authorship ($book, $allAuthors) isa Authorship
 
endTx

Can LLM Vectors be stored and indexed?

This would be very useful, to store LLM vectors along with entities or relations. Can it be done using structs or lists somehow? LLM are going to keep getting bigger, so it'll have to be addressed at some stage. Need to connect TypeDB to natural language meaning, which is a vector in the case of LLM's.

Can LLM Vectors be stored and indexed?

This would be very useful, to store LLM vectors along with entities or relations. Can it be done using structs or lists somehow? LLM are going to keep getting bigger, so it'll have to be addressed at some stage. Need to connect TypeDB to natural language meaning, which is a vector in the case of LLM's.

It's not just vector storage, but the Approximate Nearest Neighbour search that is also required.