[Bug] Caching filling up Airflow nodes disk
tatiana opened this issue · comments
Astronomer Cosmos Version
1.4.0 - 1.42
If "Other Astronomer Cosmos version" selected, which one?
1.4.2
dbt-core version
Any
Versions of dbt adapters
Any
LoadMode
DBT_LS
ExecutionMode
LOCAL
InvocationMode
None
airflow version
Any
Operating System
Any
If a you think it's an UI issue, what browsers are you seeing the problem on?
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
Any
What happened?
When #904 was implemented, we stored the partial_parse.msgpack
per DbtDAG
or DbtTaskGroup
in the local disk. By the time, we considered doing this by dbt project, but there were concerns about concurrent tasks/DAGs accessing and changing this value concurrently, resulting in the cache purging often and the overall performance improvement not being good enough.
During the implementation, we noticed that partial_parse.msgpack
relied on the Manifest
file - it was not self-sufficient. For this reason, we started also caching the manifest.json
file alongside the partial_parse.msgpack
file.
What we didn't expect is that many Airflow deployments, such as MWAA, do not handle well the temporary directory, as reported in #1025 (comment) by @maslick
our issue with cache enabled was that it quickly fills up the 30 GB storage volume on MWAA workers. We are running a large number of DAGs (>500) with lots of dbt models/transformations... We can't increase the storage volume on MWAA either. That's why we decided to disable the cache entirely. Of course, this would increase the parsing times...
Examples of file sizes
Jaffle shop: https://github.com/dbt-labs/jaffle-shop
-rw-r--r-- 1 tati staff 572K 5 Jun 09:34 manifest.json
-rw-r--r-- 1 tati staff 574K 29 May 09:48 partial_parse.msgpack
Gitlab Analytics: https://gitlab.com/gitlab-data/analytics/-/tree/master/transform/snowflake-dbt
-rw-r--r-- 1 tati staff 37M 5 Jun 10:40 manifest.json
-rw-r--r-- 1 tati staff 34M 5 Jun 10:40 partial_parse.msgpack
Assuming the files were around 37M, and we have 2 of them cached by DbtTaskGroup
or DbtDag,
If users had approximately 400 DbtTaskGroup
s or DbtDag
s, this would lead to 30GB of storage in the temporary directory. If the Airflow deployment does not handle temporary directories, this may fill up the disk.
As mentioned in the thread, once we store this cache remotely, as described in #927, we'll avoid this type of issue.
As suggested by @josefbits in the thread: #1025 (comment), we can change the partial parse caching strategy from being specific from DbtTaskGroup
and DbtDag
to dbt project based. We must evaluate how this plays with concurrent tasks / DAGs / TaskGroups - and if it works, we should change it.
Relevant log output
No response
How to reproduce
N/A
Anything else :)?
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Contact Details
No response