[ADAP-1040] Refreshing incremental materialized views takes longer than expected

Question

[ADAP-1040] Refreshing incremental materialized views takes longer than expected

cmcnicoll opened this issue 6 months ago · comments

cmcnicoll commented 6 months ago

Is this a new bug in dbt-redshift?

I believe this is a new bug in dbt-redshift
I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Refreshing materialized views manually is significantly faster than refreshing via dbt.

Expected Behavior

Refreshing via dbt is faster, especially when the underlying data has not changed and the model configs have not changed.

Steps To Reproduce

Set project config.

# test_mv\dbt_project.yml
models:
  test_mv:
    +auto_refresh: false
    +backup: true
    +materialized: materialized_view
    +on_configuration_change: fail

Create 8 models.

-- test_mv\models\example\model1.sql
select * from {{ source("qa", "source1") }}

First dbt run --threads 8 to create incremental materialized views.

Finished running 8 materialized_view models in 0 hours 0 minutes and 44.23 seconds (44.23s).

Second dbt run --threads 8 to do initial refresh.

Finished running 8 materialized_view models in 0 hours 0 minutes and 32.66 seconds (32.66s).

Third dbt run --threads 8 to get timing for refreshing via dbt.

Finished running 8 materialized_view models in 0 hours 0 minutes and 32.45 seconds (32.45s).

In DBeaver, click SQL Editor, then click Execute queries in separate tabs.

refresh materialized view "skynet"."dbt_mv"."model1";
refresh materialized view "skynet"."dbt_mv"."model2";
refresh materialized view "skynet"."dbt_mv"."model3";
refresh materialized view "skynet"."dbt_mv"."model4";
refresh materialized view "skynet"."dbt_mv"."model5";
refresh materialized view "skynet"."dbt_mv"."model6";
refresh materialized view "skynet"."dbt_mv"."model7";
refresh materialized view "skynet"."dbt_mv"."model8";

Materialized view model1 is already up to date.
Materialized view model2 is already up to date.
Materialized view model3 is already up to date.
Materialized view model4 is already up to date.
Materialized view model5 is already up to date.
Materialized view model6 is already up to date.
Materialized view model7 is already up to date.
Materialized view model8 is already up to date.

Total time (ms)	4371

Relevant log output

No response

Environment

- OS: Windows 11 Pro
- Python: 3.11.5
- dbt-core: 1.7.2
- dbt-redshift: 1.7.0
- Cluster: test environment with no other users
- Node type: ra3.xlplus
- Number of nodes: 2

Additional Context

I'm evaluating if dbt can be used for near real time reporting using incremental materialized views in staging and core models.

Mike Alfare · Answer 1 · Thu Dec 07 2023 08:19:48 GMT+0800 (China Standard Time)

Thanks for the submission @cmcnicoll! I think the two things you're comparing might not be exactly the same.

When a dbt run is executed with materialized views, a few things happen that lead up to a state changing command being issued to the server. At a high level, it looks something like this:

Query the database to get a list of all objects in the target schema along with their types (table, view, materialized view)
Query the database to build out the dependency tree for ordering nodes in the model graph
For each model (materialized view in this case):
- Does the object exist?
  - No > Create the materialized view
  - Yes > Is the existing object a materialized view?
    - No > Drop the existing object then create the materialized view
    - Yes > Both objects are materialized views; do they have the same configuration?
      - No > Apply updates via alter or apply a full refresh (depending on the type of configuration change)
      - Yes > Refresh the materialized view

In order for dbt to know that all that's needed is a refresh statement, it needs to do this investigative work first. When you ran the refresh statement in DBeaver, you already subconsciously observed that your project structure has not changed, nor did the model definition or configuration change since the last run. Hence executing a refresh statement is the only required step. In other words, for dbt run the stopwatch started at the first line in the logic above; whereas for manual execution in DBeaver the stopwatch didn't start until the last line in the logic flow above.

I agree that the 7x-8x jump is larger than I would have expected. I'll talk through my thought process here and see where we wind up. When working with test sets, oftentimes the bottleneck is not the size of the data, the local application nor the server, it's the network time in getting your command to the server. I'm going to assume that the network lag is somewhere around the 4s that you're seeing for the refresh command to return since all the server did was determine that nothing needed to happen. That means a query costs a minimum of 4s. Walking through the logic flow above, we need to run the following queries (at least):

Caching query
Object dependency query
Materialized view configuration query
Refresh query

Let's assume I'm not forgetting anything (not a safe assumption). And I'll point out that these four queries run in series. That puts us at 17.5s (4.371s x 4) just in network lag time, leaving 15s (32.5s - 17.5s) for query execution of the first three queries (assume refresh == 0s). That feels a little higher than expected, but not alarming, especially considering that we have not accounted for any dbt work yet.

At the moment, I'm not inclined to regard this as a bug. Hopefully this context helps. I'll leave this open in case you disagree though.

cmcnicoll · Answer 2 · Thu Dec 07 2023 14:33:37 GMT+0800 (China Standard Time)

Thanks @mikealfare for the context! I appreciate you taking the time to write that out.

I wasn't sure if this should be a bug, feature or discussion. The main drivers for creating this issue are:

Refreshing 200+ incremental materialized views takes approximately the same amount of time as creating the materialized views.
Creating/refreshing materialized views do not run concurrently compared to using materialized="table".
Using materialized views for near real-time reporting might not be possible yet, considering that materialized_view is a new feature.

Based on the context you provided, should this issue be changed to a feature request? For example, speeding up the investigative queries and/or being able to bypass them when trying to refresh every 5 minutes.

Or should this be moved to a discussion? I can expand on my use case, if that helps.

Mike Alfare · Answer 3 · Fri Dec 08 2023 00:36:31 GMT+0800 (China Standard Time)

1. Refreshing 200+ incremental materialized views takes approximately the same amount of time as creating the materialized views.

In your post you have auto_refresh: false. Have you considered setting this to true? This would allow Redshift to refresh on a schedule, which should behave similarly to manually running the refresh statement (minus the network lag).

2. Creating/refreshing materialized views do not run concurrently compared to using `materialized="table"`.

This is interesting. I'd like to look into why that's the case as it should run in parallel if there are no dependencies between the materialized views.

3. Using materialized views for near real-time reporting might not be possible yet, considering that `materialized_view` is a new feature.

Could you elaborate? Is this due to the time it takes to refresh, or something else?

Based on the context you provided, should this issue be changed to a feature request? For example, speeding up the investigative queries and/or being able to bypass them when trying to refresh every 5 minutes.

If 2 is true, I would argue that's a bug. The intention is to run as many models in parallel as we can.

I could see speeding up the investigative queries as being a performance enhancement down the road. Ideally that happens at the same time as building the cache. However, since only materialized views support on_configuration_change at the moment, we would impact everyone for a significant subset of users. Once we implement on_configuration_change for views and tables, then I think it makes sense to move that up in the flow, which would increase performance.

If your intention is to bypass the investigative queries, I strongly suggest looking into turning on auto_refresh. Materialized views work more like views than tables. The only reason you would re-run a view in dbt is because the logic or configuration has changed. Hence dbt run is effectively a deploy step for views. By comparison, you would run a table whenever you want to deploy OR update the data in the table. The same is true of incremental or snapshot models. A materialized view works more like a view, but the data needs to be updated in it like an incremental model. The difference is that incremental models require a dbt run. The benefit of a materialized view is that you can tell the server to do the refresh step for you using auto_refresh. In that context, I would look at dbt run for materialized views as a deployment only step. It's not an efficient way to execute a refresh statement if that's all that's needed (as you've identified here).

cmcnicoll · Answer 4 · Fri Dec 08 2023 21:11:42 GMT+0800 (China Standard Time)

Thanks again @mikealfare for the detailed response! Here is a quick overview of my use case.

Package config:

+auto_refresh: false
+backup: true
+materialized: materialized_view

Union identically-structured data sources:

+auto_refresh: true
+tags:
  - mv_auto_refresh

Base:
- Cannot use auto_refresh because of these limitations:

Auto-refresh is not supported for materialized views defined on other materialized views.

Staging/Intermediate/Core:
- Use an incremental model when query limitations prevent use of an incremental materialized view.

And here are some timings from elementary-data/elementary:

Create materialized views: median_execution_time = 22.7
Refresh materialized views (--exclude tag:mv_auto_refresh): median_execution_time = 21.2
Drop dev schema and try +materialized: table instead: median_execution_time = 8.4

Creating/refreshing materialized views do not run concurrently compared to using materialized="table".

Sorry, this is not true. When I check dbt.log, the investigative queries are indeed running in parallel. Each materialized view taking ~20s to complete with a ~4s delay between makes it seem like they are running in series.

Using materialized views for near real-time reporting might not be possible yet, considering that materialized_view is a new feature.

I was hoping that chaining incremental materialized views together would handle slowly changing dimensions and would be faster than incremental models. It seems like the best use case for incremental materialized views is for source models with auto_refresh on. Speeding up the investigative queries would allow incremental materialized views to compete with the speed of incremental models, which would help my use case.

Mike Alfare · Answer 5 · Sat Dec 09 2023 05:07:49 GMT+0800 (China Standard Time)

Ah ok, the context helps a lot. Agreed that there are some significant limitations on materialized views, and they vary by platform. Based on your context and use cases, would it be fair to reclassify this into two enhancements? Namely:

Speed up investigative queries (perhaps by tying into the initial cache build)
Add configuration that would allow dbt to attempt a refresh first, defaulted to "no", and upon failure resume the existing workflow

Cc: @Fleid

cmcnicoll · Answer 6 · Sat Dec 09 2023 12:28:23 GMT+0800 (China Standard Time)

Yes, that sounds good to me! Thanks!