astronomer / astronomer-cosmos

Run your dbt Core projects as Apache Airflow DAGs and Task Groups with a few lines of code

Home Page:https://astronomer.github.io/astronomer-cosmos/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Bug] Caching filling up Airflow nodes disk

tatiana opened this issue · comments

Astronomer Cosmos Version

1.4.0 - 1.42

If "Other Astronomer Cosmos version" selected, which one?

1.4.2

dbt-core version

Any

Versions of dbt adapters

Any

LoadMode

DBT_LS

ExecutionMode

LOCAL

InvocationMode

None

airflow version

Any

Operating System

Any

If a you think it's an UI issue, what browsers are you seeing the problem on?

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

Any

What happened?

When #904 was implemented, we stored the partial_parse.msgpack per DbtDAG or DbtTaskGroup in the local disk. By the time, we considered doing this by dbt project, but there were concerns about concurrent tasks/DAGs accessing and changing this value concurrently, resulting in the cache purging often and the overall performance improvement not being good enough.

During the implementation, we noticed that partial_parse.msgpack relied on the Manifest file - it was not self-sufficient. For this reason, we started also caching the manifest.json file alongside the partial_parse.msgpack file.

What we didn't expect is that many Airflow deployments, such as MWAA, do not handle well the temporary directory, as reported in #1025 (comment) by @maslick

our issue with cache enabled was that it quickly fills up the 30 GB storage volume on MWAA workers. We are running a large number of DAGs (>500) with lots of dbt models/transformations... We can't increase the storage volume on MWAA either. That's why we decided to disable the cache entirely. Of course, this would increase the parsing times...

Examples of file sizes

Jaffle shop: https://github.com/dbt-labs/jaffle-shop

-rw-r--r--   1 tati  staff   572K  5 Jun 09:34 manifest.json
-rw-r--r--   1 tati  staff   574K 29 May 09:48 partial_parse.msgpack

Gitlab Analytics: https://gitlab.com/gitlab-data/analytics/-/tree/master/transform/snowflake-dbt

-rw-r--r--   1 tati  staff    37M  5 Jun 10:40 manifest.json
-rw-r--r--   1 tati  staff    34M  5 Jun 10:40 partial_parse.msgpack

Assuming the files were around 37M, and we have 2 of them cached by DbtTaskGroup or DbtDag, If users had approximately 400 DbtTaskGroups or DbtDags, this would lead to 30GB of storage in the temporary directory. If the Airflow deployment does not handle temporary directories, this may fill up the disk.

As mentioned in the thread, once we store this cache remotely, as described in #927, we'll avoid this type of issue.

As suggested by @josefbits in the thread: #1025 (comment), we can change the partial parse caching strategy from being specific from DbtTaskGroup and DbtDag to dbt project based. We must evaluate how this plays with concurrent tasks / DAGs / TaskGroups - and if it works, we should change it.

Relevant log output

No response

How to reproduce

N/A

Anything else :)?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Contact Details

No response