[Feature] Datasharing: cross-database support

Question

[Feature] Datasharing: cross-database support

dataders opened this issue 2 months ago · comments

Describe the feature

What

I think this is a direct successor to #217. There's many other related issues listed below.

Today, the only supported cross-database scenario in dbt-redshift is to allow source tables to be in another database. There are considerable caveats to this narrow use case:

a cluster is used of node_type ra3 or serverless
a profiles.yml profile that specifies: ra3_node: true
a source table is defined in a database different than what's given in profile
all models referencing the "foreign db" source table must be materialized as a table (views are not supported)

there is a newer feature, still in preview, Datashare, that allows writes to external databases. For example, if I am logged into the FOO database on cluster FIZZ, a configuration exists such that I can create tables within database BAR on cluster BUZZ (and vise versa).

Ideal end-state

A dbt-redshift project could take any model in the DAG and have it materialized in another with a simply as providing {{ config( database = 'BAR' ) }} at the top of the model. That is, provided that the databases used have been properly configured as such.

How

Challenges

The challenges in supporting this new feature are varied:

problem	example
dbt-redshift inheirits from dbt-postgres	`redshift__create_schema()` which invokes `postgres__create_schema()` which only uses two-part names, `schema.relation`, exlcuding a required `database`
dbt-redshift relies on postgres metadata views that don't support Datashares	`redshift__get_columns_in_relation()` queries `information_schema."columns"` but should perhaps use `SVV_ALL_COLUMNS` instead`
discrepancy across Redshift SKUs for support of Datashares	`ra3_node` and `serverless` supports Datasharing/crossdb-writes, `dc2` clusters do not
discrepancy in performance between `pg_` metadata tables and `SVV_` datashare-supporting Redshift system tables	`information_schema."columns"` takes a few dozen millisecionds to return all column metadata in the current database. `SVV_ALL_COLUMNS` can be more than 50X slower than this

Work required

Before the actual coding begins, it is imperative that we first learn the following:

what datasharing scenarios could feasibly be supported
the current & future best practice for fetching metadata
the performance impact of using SVV_* tables for all SKUs (ra3, serverless, dc2)

Possible work required

Beta Give feedback

shift away from pg_ metadata queries to those that include "external" database metadata
override any postgres macros that exclude database from relation name e.g. redshift__create_schema()
(possible) remove both verify_database method and ra3_node profile parameter
ensure functionality of metadata-related macros in "core" packages such as dbt-codegen, dbt-utils, dbt-audit-helper etc
write cross database integration tests
add RA3 and serverless to our CI pipeline
Options

redshift RA3 nodes support cross-db references, but dbt prevents this in compilation dbt-core#3179

redshift
cross db queries in redshift dbt-core#3236

bug redshift
[CT-685] [Bug] Redshift - second target database requires ra3_node: true and ignores database name dbt-core#5297

bug
[CT-454] Add env var for ra3_node profile config #94

Stale enhancement
[CT-1122] [CT-1116] [Bug] adapter.get_relation for relations in other databases [RA3] #179

bug help_wanted ra3_node
[CT-1482] [Feature] Add support for data share objects #217

enhancement ra3_node
[CT-1909] [CT-1903] Unable to generate docs from dbt core #281

bug ra3_node
[ADAP-634] [Bug] Redshift fails to run several threads #501

bug ra3_node
[ADAP-726] [Feature] Remove references to pg catalogs in hard-coded queries #555

enhancement ready_for_review
[ADAP-979] [Bug] Database Error: Only user perm tables supported in producer views #656

ra3_node support
[Bug] adapter.get_columns_in_relation does not work cross database #736

awaiting_response bug ra3_node
Options

relevant AWS Redshift docs

Who will this benefit?

dbt-redshift users who want to use Datasharing to federate a dbt project across databases and clusters

Ryan Waldorf · Answer 1 · Tue Apr 09 2024 08:07:59 GMT+0800 (China Standard Time)

Q: What data sharing scenarios could feasibly be supported?
There's a few different scenarios that could be supported. Here are some of the key ones:

Split ELT workloads to improve performance or reduce costs. Until now, only a single Redshift cluster or serverless workgroup (warehouse) could write to a database. As a result, customers could often end up with very large warehouses sized for their peak capacity. Sometimes there's resource contention making it hard to meet deadlines. Sometimes customers can be paying for more than they need. This allows customers to split workloads between different warehouses tuned for their workloads.
Split workloads for cost attribution. Sometimes different ETL workloads are primarily owned by different teams but all writing to the same databases. E.g. an operations team ELTing order data from DynamoDB and a Sales Team ELTing data from their sales platform. By using different warehouses, they can split the costs easily.
Live collaboration. Sometimes teams collaborate on live data. E.g. regional sales teams all updating records in the same global Customer 360 table. This allows different regional sales teams with different warehouses in different accounts and regions to write to the same Customer 360 tables even though they otherwise have nothing to do each other.

Q: The current & future best practice for fetching metadata?
Ideally, you leverage the redshift-connector metadata APIs and not have to think about redshift system tables. We plan to offer Metadata API support inline with our JDBC driver when we announce the General Availability of Multi-Warehouse Writes through Data Sharing. If this isn't possible because our APIs don't support your use cases lets chat so we can figure out how to fix this. If we need to provide more info in our APIs, we need to provide more info in our APIs.

Q: The performance impact of using SVV_ tables for all SKUs (ra3, serverless, dc2)?
Again, ideally you leverage the redshift-connector metadata APIs and don't think about Redshift system tables. However, if you need to use them, a few recommendations:

Use svv_columns instead of svv_all_columns when querying a specific database's columns. If you can further reduce it to a Redshift table or external table, you can query svv_redshift_columns or svv_external_columns respectively to improve performance further.
Make sure to include effective predicate on ‘database_name’/‘schema_name’/‘table_name’ to not over search and cause unnecessary performance overhead of a broad lookup
But again, if you're having to do this, there may be a problem so let's see what we need to do to fix this.

Anders · Answer 2 · Tue Apr 16 2024 02:32:14 GMT+0800 (China Standard Time)

re: metadata over the API I think we're totally on the same page. Can discuss that more down the line.

If this isn't possible because our APIs don't support your use cases lets chat so we can figure out how to fix this. If we need to provide more info in our APIs, we need to provide more info in our APIs.

What concerns me most is not the APIs not having enough info. Instead my issue is maintaining our users' performance expectation. @ryanwaldorf check out #741 (and my response).

Anders · Answer 3 · Tue Apr 16 2024 03:18:16 GMT+0800 (China Standard Time)

with respect to workloads, I totally buy the business case for cross-cluster reads & writes. My concern is the perspective that a dbt end-user.

How a dbt user thinks about their dbt project

a dbt project is just a single directed, acyclic graph (DAG) where the nodes of the graph are models (tables/views)
the database of a model is just a configuration
any model or group of models can be easy configured to be materialized in another database

Mapping dbt perspective to Redshift Datasharing

If Redshift Datasharing experience is to map directly to the expectations communicated by the dbt framework, then the boundary between models in different clusters has to feel as invisible as models that share a database.

Based on how much of the below is met, we have the following options:

Support-level	Action
Full	No action. dbt assumptions hold for cross-database
Nealy complete (some caveats)	heavily communicate caveats
partial	recommend dbt Mesh project separation

One example caveat is that we already know that a view of a tables in another db is not supported. This should be clearly documented and communicated (at least until it is supported).

If the full set of Datasharing cross-database capabilities results in intra-dbt-project experience that is not "invisible". Perhaps, instead we take that stance that "best practice" is to mirror the database boundary as as dbt project boundary. This aligns well with the concept of dbt Mesh

Expectations of full-support cross-database queries in dbt

For example, creating the following dependencies in objects within the same database is easy, but what if they're in distinct databases? Assuming that the administration to connect the clusters has already been done, which of the following are possible today with Datasharing?

create a table that depends on a table in another cluster's database
create a table that depends on a view in another cluster's database
create a view that depends on a table in another cluster's database
create a view that depends on a view in another cluster's database

Adding on to that:

what about temp tables?
what about altering the consumer objects after they've been created? to rename them or add rename or recast columns?

Ryan Waldorf · Answer 4 · Wed Apr 17 2024 02:55:42 GMT+0800 (China Standard Time)

Saw the issue with the driver performance over time. I can't speak to what happened with those previous releases but we should be able to address this together going forward through better communication between our teams. All the key folks for this going forward will be included in our Friday meeting.

Ryan Waldorf · Answer 5 · Wed Apr 17 2024 04:20:37 GMT+0800 (China Standard Time)

If Redshift Datasharing experience is to map directly to the expectations communicated by the dbt framework, then the boundary between models in different clusters has to feel as invisible as models that share a database.
This is our intent. With the metadata changes we're working on we should get there.

One example caveat is that we already know that a view of a tables in another db is not supported. This should be clearly documented and communicated (at least until it is supported).
Not 100% true since it's available in SVV_DATASHARE_OBJECTS. However, it requires additional work hence why we're making the metadata changes. You shouldn't have to do this.

For what's possible

create a table that depends on a table in another cluster's database--possible, no restrictions
create a table that depends on a view in another cluster's database--possible, no restrictions
create a view that depends on a table in another cluster's database--late-binding views yes, standard views no
create a view that depends on a view in another cluster's database--late-binding views yes, standard views no
temp tables--shouldn't be any restrictions but I may need a bit more clarity
we don't have support for all alters yet. Add/remove column and rename table/view/column will be possible at GA. Some other ones like alter dist style will take a bit longer. If you could tell me what the most important alters are so that we can make sure we're prioritizing right that would be appreciated

Anders · Answer 6 · Wed Apr 17 2024 05:03:25 GMT+0800 (China Standard Time)

3. create a view that depends on a table in another cluster's database--possible, no restrictions

I'm surprised to hear that this is supported because I literally got an error when I tried to do this. Will dig up the specific error message again.

Ryan Waldorf · Answer 7 · Wed Apr 17 2024 05:51:59 GMT+0800 (China Standard Time)

wait sorry I messed up my answer. you did get an error. it needs to be a late-binding view. updated my answer above and bolded the changes