DataJunction / dj

A metrics platform.

Home Page:http://datajunction.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dimension Node Join Link + Dimension Alias/Reference Link

shangyian opened this issue · comments

commented

When linking dimensions, there are two types of links: join links and alias/reference links.

Join Link

You can configure a join link between any table/view-like node (source, transform, dimension) and a dimension node. Configuring this join link will make it so that all dimension attributes on the dimension node are available for the original node.

An example of a dimension join link:

erDiagram
    "default.fact_transform" {
        long user_id 
        int country_id
        long event_secs
        long event_ts
    }
   "default.country_dim" {
        int id PK
        str name
        long population
    }
   "default.fact_transform" ||--o{ "Dimension Join Link" : "linked via"
   "default.country_dim" ||--o{ "Dimension Join Link" : "linked via"

   "Dimension Join Link" {
        str join_on "default.fact_transform.country_id = default.country_dim.id"
        enum join_type "LEFT"
        str role "event_country"
    }
  • In most cases, the join ON clause will just be equality comparisons between the primary key and foreign key columns of the original node and the dimension node. More complex join clauses can be configured if desired.
  • While we offer the ability for the user to specify RIGHT, LEFT or INNER joins, in practice when generating SQL, all RIGHT joins will be recast as LEFT joins (for performance reasons).

Alias/Reference Link

You can configure a dimension alias/reference between a particular column on a table/view-like node (source, transform, dimension) and a column on a dimension node. An example:

erDiagram
    "default.fact_transform" {
        long user_id 
        string country_name
        long event_secs
        long event_ts
    }
   "default.country_dim" {
        int id PK
        str name
        long population
    }
   "default.fact_transform" ||--o{ "Dimension Alias/Ref": ""
   "default.country_dim" ||--o{ "Dimension Alias/Ref" : ""

   "Dimension Alias/Ref" {
        str column "default.fact_transform.country_name"
        str dimension_column "default.country_dim.name"
        str role "event_country"
    }

In this case, configuring a reference between default.fact_transform.country_name and default.country.name will indicate that the semantic meaning behind default.fact_transform.country_name refers to the default.country dimension's name field. No join is required here, unless we explicitly create a dimension join between the two nodes.

When we have monitoring capabilities, we'll be able to enforce dimension conformity.

Proposed Changes

Add a new backend endpoint for dimension aliasing/referencing

This endpoint can be done at the column-level, as in, for a given node column, the user can tell DJ if the column is meant to reference a particular dimension attribute.

POST /nodes/{node_name}/columns/{column_name}/alias
{
   "dimension_node": "<dimension node>",
   "dimension_column": "<column on dim node>"
}

In the database, we can store this using the dimension_id and dimension_column fields on the current column's table. This was previously used for join links, which we now represent via the dimensionlink table in the database.

Expose the join link and the alias link functionalities in the UI

The join link vs alias link functionalities should be exposed separately in the UI, so that it is clearer to the user which type of link they are creating.

Alias links should probably continue to live under the Linked Dimensions column:
Screenshot 2024-03-04 at 4 44 14 PM

Join links can have a separate "create" button from elsewhere in the UI.

This makes sense. I just have some reservations about the following statement, but let's talk about it before I spell out my concerns:

While we offer the ability for the user to specify RIGHT, LEFT or INNER joins, in practice when generating SQL, all RIGHT joins will be recast as LEFT joins (for performance reasons).

commented

To clarify, when the user specifies RIGHT JOIN, we'll be swapping the tables from the left hand side to the right hand side and doing a LEFT JOIN, but in practice this is the same as a right join. This is mainly for performance optimization purposes, where some engines aren't optimized for right joins.