TypeDB 3.0 Roadmap

Question

TypeDB 3.0 Roadmap

flyingsilverfin opened this issue a year ago · comments

Joshua Send commented a year ago

Problem to Solve

We collect the agreed list of changes and requirements that will be in the first version of TypeDB 3.0.

Changes

API

#6772
#7019

Driver

#7020

TypeQL

Value restriction:

Require further discussion:

Relation implementation

#6769
#6771

Changes proposed and rejected:

Immutable relations
#6770

Haikal Pribadi · Answer 1 · Fri Mar 17 2023 02:36:21 GMT+0800 (China Standard Time)

Let's make sure each of them is documented properly in an issue, @flyingsilverfin

Joshua Send · Answer 2 · Fri Mar 17 2023 16:43:53 GMT+0800 (China Standard Time)

Yes @haikalpribadi that's what the colons are for :D I have to get to that next

Joshua Send · Answer 3 · Fri Mar 17 2023 20:52:28 GMT+0800 (China Standard Time)

Internal changes

A ? indicates not yet fully discussed.

TypeQL

Fix backslash escaping
? - [ ] Allow modifiers on match inside of a delete/insert query to allow flexible query operations such as batching
? - [ ] Rename MatchQuery to GetQuery. To discuss: how this plays with the above

RPC

Replace the session ID with a long, instead of an inefficient vector

Pattern & Resolvables

Implement the query representation as a set of constraints that own variables, instead of the other way around
Remove the idea of Resolvables (eg. Concludables, etc.) and merge them into Patterns

** Concepts **

The schema concept layer should more aggressively cache, in more CPU friendly format, various shortcuts such as as owned attribute types directly, without having to traverse through the super types as well. Additionally, we should store all schema-level data in flat sorted arrays (likely never exceeds about 100mb in size with the largest possible schemas) to optimise access.

Traversal & Reasoner

Rearchitect reasoner to manage its own memory
Push Concept to be the bottom layer of the database that traversals and reasoner operate over. Below that can still exist a graph and storage layer, but they should not be exposed
Convert explain into a explain() query that takes a query and bounds. Alternatively, we could just explain the existence of an inferred concept without a query? Also, convert explanations into something more native?
Handle negations in the traversal natively

Loïc Veillard · Answer 4 · Sat Mar 18 2023 06:32:23 GMT+0800 (China Standard Time)

Thank you guys for working on this and sharing it with us.
The two key features that extremely limit our use cases and I'm missing in this list:

Optionals/fetch

#6322
Including a way to have optional played roles also, not only attributes.

Vectors and ordered lists

I see they've been discarded :S, but there is no simple workaround for this:
#6327

Vectors

We don't need them as a particular attribute types, maybe a @sortable or @indexable when defining relations would work.

Storing something like this in typeDB is really hard, and mutate it (add items at particular positions) in a performant way is near to impossible

Ordered lists

This also includes ordered lists with repeated values which are really hard to store in typeDB.
Ex: [1,2,2,3,7,2] or ['blue', 'green', 'green', 'red']

Loïc Veillard · Answer 5 · Thu Nov 30 2023 22:43:46 GMT+0800 (China Standard Time)

I think the proposal is only to remove repeated roles & players role1: $player1, role1: $player1 While this would keep working: role2: $player1, role2: $player2 So it's not about removing the cardinality MANY of roles, but removing the possibility of a player to play the same role multiple times. So basically, no repetition. Which I agree is not the most common use case out there. But it does happen. Btw a workaround for this in the new format would be to create an intermediary entity "event" for instance so instead of A<>B & A<>B we would have to do A<>EV1<>B & A<>EV2<>B

brettforbes · Answer 6 · Tue Apr 30 2024 17:40:52 GMT+0800 (China Standard Time)

Hi All,

After speaking to Haikal, there are good reasons to move from the Concept API to Fetch, particularly speed. At present it takes between 2-5 secs to retrieve an object from TypeDB and transpile it to valid Stix JSON. This is mainly due to all of the network roundtrips that have to be done, so clearly one fetch query will be more effective.

The advantage of our current system is it is shape-based, so I can handle all JSON objects using the same ORM, the disadvantage is speed.

The new approach does mean a lot more code, since we have to build quite long Fetch statements for each individual object (e.g. 16-44) lines for each of our 85 objects, and then build the transpile code (from returned Fetch JSON to Stix JSON). This figure assumes a single main object, 4 lines, and then 3-11 optional sub objects with relations, with 4 lines each, if we use the class hierarchy. But the benefit will be far greater speed, totally agreed.

We probably wont be able to make this move for some months, due to resourcing, but we agree it will be worth it. At the same time we can update our 2.500 lines of schema code to v3. This will place us in good position to add on another 50-80 cybersecurity objects (e.g. SBOM's, Vulnerabilities, Risk etc.)

Onwards and upwards for TypeDB and our cybersecurity application!!

Loïc Veillard · Answer 7 · Fri May 03 2024 02:09:19 GMT+0800 (China Standard Time)

I hope we will get the same tree structure for mutations. Batch mutations and optional mutations are currently a nightmare, while queries with fetch are so smooth.

A point of enhancement could be to be able to use multiple match fetch in the same query, and same for the mutations, instead of having a single entry point.

This is possible in the nested branches, we can open multiple ones and asign them to different keys, but it is not possible to have multiple keys at the root level.

Loïc Veillard · Answer 8 · Fri May 03 2024 02:30:07 GMT+0800 (China Standard Time)

Another key conceptual blocking point in mutations for us is how cardinality MANY is handled. Whenever the match clauses start doing permutations, the insert / delete are run as in a FOR loop.

This issue has an example of one insertion that is run N times against intuition:
#6902

In 3.0 I would love to see $vars being aware of their cardinality. The way that dgraph executes this type of mutations is really intuitive, each variable holds and array of iids, so if a match does something like this

match
$jobPosition isa jobPosition has id 'frontendDeveloper';
$candidate isa Person, has name 'Junior Peter':
$allInterviewers isa Person, has departMentName 'IT':

insert
$selectionProcess ( candidate: $candidate, job: $jobPosition, interviewers: $allInterviewers) has id 'selectionProces1':

This would be run a single time and create a sinfle selectionProcess as expected.

Alternatives:
a) in order to enable FOR loops as they happen now, new syntaxis for loops could be created, which are more rare cases.

b) Another alternative would be to indicate the type of cardinality in the roles when defining the schema, so we now which things are treated as arrays and store multiple iiids in the $var, and which things follow current behaviour

c) Yet another alternative could be to clearly define array variables, for instance doing []allInterviewers isa .... instead of $allInterviewers isa ....

brettforbes · Answer 9 · Tue May 07 2024 17:17:18 GMT+0800 (China Standard Time)

v3.0 is looking awesome, but can you also detail TIME and GPS please

V3.0 a pretty massive rewrite, and in fact we may probably reengineer the schema, since originally Tomas adopted the Vaticle style guide, and made all of the property names different from the TypeDB ones. The consequence of this is that Fetch statements must be super long to include every property, every sub-objects and all of its properties. If the variable names are the same then Fetch would be more powerful and concise.

Still, the powerful new capabilities of v3.0 make it worth this re-engineering, as long as TIME and GPS are sorted. Please provide architectural best practice for these two, thanks

Loïc Veillard · Answer 10 · Fri May 24 2024 21:16:07 GMT+0800 (China Standard Time)

So after lot of thought Im changing my wishlist priorities. My key needed feature is being able to share $vars between different streams. This would fix almost every issue we are facing with mutations and is something enabled in most databases.

As an example:

startTX

   insert
      $b isa Book, has id 1
    ---
    match
       $allAuthors isa Author
    ---
    insert
      $authorship ($book, $allAuthors) isa Authorship
 
endTx

brettforbes · Answer 11 · Tue May 28 2024 13:45:58 GMT+0800 (China Standard Time)

Can LLM Vectors be stored and indexed?

This would be very useful, to store LLM vectors along with entities or relations. Can it be done using structs or lists somehow? LLM are going to keep getting bigger, so it'll have to be addressed at some stage. Need to connect TypeDB to natural language meaning, which is a vector in the case of LLM's.

Steve Pritchard · Answer 12 · Sat Jun 08 2024 12:22:10 GMT+0800 (China Standard Time)

Can LLM Vectors be stored and indexed?

This would be very useful, to store LLM vectors along with entities or relations. Can it be done using structs or lists somehow? LLM are going to keep getting bigger, so it'll have to be addressed at some stage. Need to connect TypeDB to natural language meaning, which is a vector in the case of LLM's.

It's not just vector storage, but the Approximate Nearest Neighbour search that is also required.