Dimension Node Join Link + Dimension Alias/Reference Link
shangyian opened this issue · comments
When linking dimensions, there are two types of links: join links and alias/reference links.
Join Link
You can configure a join link between any table/view-like node (source, transform, dimension) and a dimension node. Configuring this join link will make it so that all dimension attributes on the dimension node are available for the original node.
An example of a dimension join link:
erDiagram
"default.fact_transform" {
long user_id
int country_id
long event_secs
long event_ts
}
"default.country_dim" {
int id PK
str name
long population
}
"default.fact_transform" ||--o{ "Dimension Join Link" : "linked via"
"default.country_dim" ||--o{ "Dimension Join Link" : "linked via"
"Dimension Join Link" {
str join_on "default.fact_transform.country_id = default.country_dim.id"
enum join_type "LEFT"
str role "event_country"
}
- In most cases, the join
ON
clause will just be equality comparisons between the primary key and foreign key columns of the original node and the dimension node. More complex join clauses can be configured if desired. - While we offer the ability for the user to specify
RIGHT
,LEFT
orINNER
joins, in practice when generating SQL, allRIGHT
joins will be recast asLEFT
joins (for performance reasons).
Alias/Reference Link
You can configure a dimension alias/reference between a particular column on a table/view-like node (source, transform, dimension) and a column on a dimension node. An example:
erDiagram
"default.fact_transform" {
long user_id
string country_name
long event_secs
long event_ts
}
"default.country_dim" {
int id PK
str name
long population
}
"default.fact_transform" ||--o{ "Dimension Alias/Ref": ""
"default.country_dim" ||--o{ "Dimension Alias/Ref" : ""
"Dimension Alias/Ref" {
str column "default.fact_transform.country_name"
str dimension_column "default.country_dim.name"
str role "event_country"
}
In this case, configuring a reference between default.fact_transform.country_name
and default.country.name
will indicate that the semantic meaning behind default.fact_transform.country_name
refers to the default.country
dimension's name
field. No join is required here, unless we explicitly create a dimension join between the two nodes.
When we have monitoring capabilities, we'll be able to enforce dimension conformity.
Proposed Changes
Add a new backend endpoint for dimension aliasing/referencing
This endpoint can be done at the column-level, as in, for a given node column, the user can tell DJ if the column is meant to reference a particular dimension attribute.
POST /nodes/{node_name}/columns/{column_name}/alias
{
"dimension_node": "<dimension node>",
"dimension_column": "<column on dim node>"
}
In the database, we can store this using the dimension_id
and dimension_column
fields on the current column's table. This was previously used for join links, which we now represent via the dimensionlink
table in the database.
Expose the join link and the alias link functionalities in the UI
The join link vs alias link functionalities should be exposed separately in the UI, so that it is clearer to the user which type of link they are creating.
Alias links should probably continue to live under the Linked Dimensions column:
Join links can have a separate "create" button from elsewhere in the UI.
This makes sense. I just have some reservations about the following statement, but let's talk about it before I spell out my concerns:
While we offer the ability for the user to specify RIGHT, LEFT or INNER joins, in practice when generating SQL, all RIGHT joins will be recast as LEFT joins (for performance reasons).
To clarify, when the user specifies RIGHT JOIN
, we'll be swapping the tables from the left hand side to the right hand side and doing a LEFT JOIN
, but in practice this is the same as a right join. This is mainly for performance optimization purposes, where some engines aren't optimized for right joins.