[Bug] /tmp/ File Not Found Error Causing Task Failure for dbt Cosmos Tasks
oliverrmaa opened this issue · comments
Astronomer Cosmos Version
Other Astronomer Cosmos version (please specify below)
If "Other Astronomer Cosmos version" selected, which one?
1.4.3
dbt-core version
1.7.17
Versions of dbt adapters
dbt-bigquery==1.7.4
dbt-core==1.7.17
dbt-extractor==0.5.1
dbt-semantic-interfaces==0.4.4
LoadMode
DBT_LS
ExecutionMode
LOCAL
InvocationMode
SUBPROCESS
airflow version
apache-airflow==2.9.2+astro.1
Operating System
Debian GNU/Linux 11 (bullseye)
If a you think it's an UI issue, what browsers are you seeing the problem on?
No response
Deployment
Astronomer
Deployment details
We have a main production deployment in Astro Cloud which we consider as production. We also do local development via astro dev start
. We have continuous deployment set up through CircleCI which deploys merged PRs to our master branch to our production deployment via astro deploy --dags
. For authentication to our data warehouse (Google BigQuery) in production, we use GoogleCloudServiceAccountDictProfileMapping
and for local we use ProfileConfig
where our dbt profiles.yml has a hardcoded path to a service account JSON file which is at the same path for each developer.
What happened?
We are still intermittently seeing FileNotFoundError: [Errno 2] No such file or directory for /tmp files
every few hours or so across multiple DAGs ever since the inception of our Astronomer/Cosmos setup. This error appears on Cosmos created dbt model run tasks. This issue affects our on-call personnel because they have to manually clear and re-run these tasks in order for the model to successfully run (the re-run usually succeeds). Some model runs must be manual re-run in order for this task to succeed and some may recover on their own.
Relevant log output
Here are four examples of errors in log output for different missing /tmp/ files:
(1)
[2024-06-24, 18:32:45 UTC] {subprocess.py:94} INFO - FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp4yx6m8en/package-lock.yml'
(2) This is a typical example for one of our models:
"FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp2pcqnmxp/models/frontroom/business_models/provider_pay/datamarts/.provider_performance.sql.JjagaL'", ''); 274)
(3) This is a typical example for one of our models:
2024-06-21, 10:17:35 UTC] {log.py:232} WARNING - [2024-06-21T10:17:35.702+0000] {subprocess.py:94} INFO - (astronomer-cosmos) - FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmph56xe15q/models/frontroom/business_models/datamarts/honest_heatlh/.honest_health_monthly_subscription_snapshots.sql.KDCDEl'
[2024-06-21, 10:17:35 UTC] {subprocess.py:94} INFO - FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmph56xe15q/models/frontroom/business_models/datamarts/honest_heatlh/.honest_health_monthly_subscription_snapshots.sql.KDCDEl'
(4) This example is for external models we use from the dbt qualtrics package:
[Errno 2] No such file or directory: '/tmp/tmpf6h80niz/models/intermediate/shipment/.int_easy_post_tracking.sql.GgjJgL'
06:02:42 Traceback (most recent call last):
File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/cli/requires.py", line 91, in wrapper
result, success = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/cli/requires.py", line 76, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/cli/requires.py", line 169, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/cli/requires.py", line 198, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/cli/requires.py", line 245, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/cli/requires.py", line 271, in wrapper
ctx.obj["manifest"] = parse_manifest(
^^^^^^^^^^^^^^^
File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/parser/manifest.py", line 1790, in parse_manifest
manifest = ManifestLoader.get_full_manifest(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/parser/manifest.py", line 318, in get_full_manifest
manifest = loader.load()
^^^^^^^^^^^^^
File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/parser/manifest.py", line 365, in load
file_reader.read_files()
File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/parser/read_files.py", line 203, in read_files
self.read_files_for_project(project, file_types)
File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/parser/read_files.py", line 210, in read_files_for_project
project_files[file_type_info["parser"]] = read_files_for_parser(
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/parser/read_files.py", line 158, in read_files_for_parser
source_files = get_source_files(
^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/parser/read_files.py", line 134, in get_source_files
fp_list = filesystem_search(project, paths, extension, ignore_spec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/parser/search.py", line 74, in filesystem_search
for result in find_matching(root, relative_dirs, ext, ignore_spec):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/clients/system.py", line 79, in find_matching
modification_time = os.path.getmtime(absolute_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen genericpath>", line 55, in getmtime
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpf6h80niz/models/intermediate/shipment/.int_easy_post_tracking.sql.GgjJgL'
How to reproduce
I believe having our setup should reproduce it as the issue happens quite often (once every few hours). We also have 1000+ dbt models so an example project of that size may be needed for reproducibility. If underlying deployment infrastructure information is required here it is:
Anything else :)?
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Contact Details
To resolve the intermittent FileNotFoundError: [Errno 2] No such file or directory for /tmp files
that is causing task failures for dbt Cosmos tasks in your Astronomer deployment, you should ensure that the temporary directories and files are properly managed and exist when needed.
Here are some steps to address this issue:
-
Manage Temporary Directories Properly:
Ensure that the temporary directories are not cleaned up prematurely. Here is a revised version of the function to ensure the temporary directory is properly managed:import tempfile import shutil from pathlib import Path def failing_test_dbt_project(tmp_path): with tempfile.TemporaryDirectory() as tmp_dir: tmp_dir_path = Path(tmp_dir) / "mini" shutil.copytree(MINI_DBT_PROJ_DIR, tmp_dir_path) target_schema = tmp_dir_path / "models/schema.yml" if target_schema.exists(): os.remove(target_schema) shutil.copy(MINI_DBT_PROJ_DIR_FAILING_SCHEMA, target_schema) yield tmp_dir_path
-
Set the
TMPDIR
Environment Variable:
Ensure that theTMPDIR
environment variable is set correctly, especially on MacOS, to avoid the value of the temp directory changing unexpectedly. You can configure this in yourcosmos/settings.py
:import os import tempfile from pathlib import Path import airflow from airflow.configuration import conf from cosmos.constants import DEFAULT_COSMOS_CACHE_DIR_NAME, DEFAULT_OPENLINEAGE_NAMESPACE # In MacOS users may want to set the envvar `TMPDIR` if they do not want the value of the temp directory to change DEFAULT_CACHE_DIR = Path(tempfile.gettempdir(), DEFAULT_COSMOS_CACHE_DIR_NAME) cache_dir = Path(conf.get("cosmos", "cache_dir", fallback=DEFAULT_CACHE_DIR) or DEFAULT_CACHE_DIR) enable_cache = conf.getboolean("cosmos", "enable_cache", fallback=True) enable_cache_partial_parse = conf.getboolean("cosmos", "enable_cache_partial_parse", fallback=True) enable_cache_dbt_ls = conf.getboolean("cosmos", "enable_cache_dbt_ls", fallback=True) propagate_logs = conf.getboolean("cosmos", "propagate_logs", fallback=True) dbt_docs_dir = conf.get("cosmos", "dbt_docs_dir", fallback=None) dbt_docs_conn_id = conf.get("cosmos", "dbt_docs_conn_id", fallback=None) dbt_docs_index_file_name = conf.get("cosmos", "dbt_docs_index_file_name", fallback="index.html") enable_cache_profile = conf.getboolean("cosmos", "enable_cache_profile", fallback=True) dbt_profile_cache_dir_name = conf.get("cosmos", "profile_cache_dir_name", fallback="profile") try: LINEAGE_NAMESPACE = conf.get("openlineage", "namespace") except airflow.exceptions.AirflowConfigException: LINEAGE_NAMESPACE = os.getenv("OPENLINEAGE_NAMESPACE", DEFAULT_OPENLINEAGE_NAMESPACE)
-
Handle Temporary Directory in
run_command
Function:
Ensure that therun_command
function incosmos/dbt/graph.py
is correctly handling the temporary directory and environment variables:def run_command(command: list[str], tmp_dir: Path, env_vars: dict[str, str]) -> str: """Run a command in a subprocess, returning the stdout.""" logger.info("Running command: `%s`", " ".join(command)) logger.debug("Environment variable keys: %s", env_vars.keys()) process = Popen( command, stdout=PIPE, stderr=PIPE, cwd=tmp_dir, universal_newlines=True, env=env_vars, ) stdout, stderr = process.communicate() returncode = process.returncode if 'Run "dbt deps" to install package dependencies' in stdout and command[1] == "ls": raise CosmosLoadDbtException( "Unable to run dbt ls command due to missing dbt_packages. Set RenderConfig.dbt_deps=True." ) if returncode or "Error" in stdout.replace("WarnErrorOptions", ""): details = stderr or stdout raise CosmosLoadDbtException(f"Unable to run {command} due to the error:\n{details}") return stdout
By ensuring the temporary directory is properly managed, configuring the TMPDIR
environment variable, and correctly handling the temporary directory in the run_command
function, you can mitigate the FileNotFoundError
issue [1][2][3].
Hi, @oliverrmaa. I'm still trying to reproduce this issue, but I have a few hypotheses, and we can try a few things out. We'll soon have an alpha release to validate one of these hypotheses.
The errors you are seeing are raised by dbt-core
itself. Cosmos 1.4.3 does not have any specific knowledge of any of the files mentioned, including package-lock.yml.
We create a temporary directory for every dbt command that will be run. This was to overcome two main issues:
- Read-only filesystems: many users needed help to write in the original dbt project folder. Older versions of Cosmos would create a full copy of the original directory. This was changed in the last Cosmos versions to be a symbolic link, where applicable (available from 1.1 until now, 1.5).
- dbt concurrency: many dbt-core commands were not designed to be run concurrently, as described in their official docs. So, running them from independent directories would help (avoid issues).
Some follow-up questions:
(i) Have you observed if any of these errors are affected by deployments to Astro?
(ii) How is your disk consumption at the moment?
(iii) Do you have task retries enabled for these DAGs? What is the current amount of retries?
(iv) Would you consider having more and more minor Astro instances with smaller concurrency and checking if that helps mitigate the issue?
(v) Which data warehouse are you using?
(vi) Which version of dbt-core are you using?
(vii) Are you currently running dbt deps
as part of the task runs? Are these errors happening when dbt deps
or the central dbt command (e.g. dbt run
) is executed?
(viii) The files that are mentioned to be missing (e.g. .provider_performance.sql.JjagaL
, .honest_health_monthly_subscription_snapshots.sql.KDCDEl
), are they part of your original project, are they dependencies or are they being created by dbt-core
dynamically (e.g. compiled SQL files, that may use macros)?
There are two low-hanging fruits I can see:
a) Assuming the issue may be with the creation of symbolic links, we can create an alpha version of Cosmos that avoids creating those, runs the commands from the original folder, and uses environment variables for dbt-core JSON artifacts and dbt-core logs.
b) We could check if this error message ("[Errno 2] No such file or directory") happened within the dbt task run, and attempt to run it as part of the same task run in Cosmos itself. However, it does feel that configuring the task_retry in Airflow may be the most suitable place.
If we implement (a) and (b), you confirm that reducing the concurrency in smaller Airflow /Astro instances leads to the same error. Another path we can follow to improve the concurrent executions of dbt tasks is not to use dbt to run the SQL but to have Airflow operators running the compiled SQL. This is a strategy companies who had to scale dbt in Airflow used in their proprietary solutions (e.g., Monzo), and were quite successful. The downside is that some macros may not work as expected, but this can be a good compromise. The downside of this strategy is that we'll have some effort per Dataware house, but it may pay off.
This issue seems to be affecting several users from the OpenSource community and Astro customers.
Feedback from @oliverrmaa :
First of all our dbt-bigquery version is 1.7.4 and dbt core version is 1.7.17 and in the dbt_project.yml , no we do not override any paths
The rest of the answers:
i) appear unaffected. They can occur even when there are no deployments (i.e. weekend)
ii) how do we check that?
iii) we rolled it out for 1 DAG (set it at 2) and it was very promising (i think the retries prevented it from being an issue for a > ~week), doesn't tell us root cause but addresses issue. We may roll it out to all DAGs.
iv) potentially open to this
v) Goolge BigQuery
vi) dbt core version is 1.7.17
vii) yup we do, we use operator_args={"install_deps": True} in our DbtTaskGroup()
viii) definitely not part of the original project, they are dependencies created dynamically. They do use macros like config() , source() , but so do other dbt models. We also have a custom macro that is on every single dbt model but its just a model that adds LIMIT 0 if the target is our CI environment
On (ii), I'd recommend the following steps:
- Open the Deployment in the Astro UI.
- Click on Overview.
- Click Get Analytics.
- You can view the following metrics related to ephemeral storage:
Ephemeral storage usage (Kubernetes Executor/KubernetesPodOperator): This metric shows how your Kubernetes tasks use available ephemeral storage as a percentage of the total ephemeral storage configured. You can adjust the graph's y-axis to better fit your data or zoom in to view details.
Ephemeral storage usage (Celery workers): This metric shows how your Celery worker uses available ephemeral storage as a percentage of the total ephemeral storage configured. You can also adjust the graph's y-axis to better fit your data or zoom in to view details [1].
After a few hours of analyzing and troubleshooting this issue, I have a few conclusions:
Cause
Either dbt-core
or another related library creates temporary files in the source models folder. I read part of the dbt-core
source code, and it was unclear where this happened. This was a surprise since I expected artifacts to be created in the target and logs folders, not in the source folders. This behavior may be associated with Jinja 2 caching or some macro or dbt adaptor. What is clear is that this is not caused by Cosmos (earlier this week I spoke to another library user to run dbt in Airflow who was facing the same issue). However, this is very likely to happen in environments like Airflow, where many concurrent "dbt" commands are run.
MItigations
There are two workarounds to the problem:
a) Use Airflow task_retry
in the DbtDag
/ DbtTaskGroup
that this is happening. As confirmed with the customer, by setting two, they no longer experienced this issue since it is unlike that this concurrency issue will happen many times in a row for the same task. It is not guaranteed, but it minimizes and works with any version of Cosmos.
b) Reduce task concurrency at the Airflow worker nodes to 1 in the node pools running Cosmos tasks, and have very small worker nodes, with a larger autoscaling upper limit.
c) Use KubernetesExecutor
(no state would be shared between task executions)
d) Use Cosmos ExecutionMode.KUBERNETES
Follow ups
We were not able to reproduce this problem. But it would be great, @oliverrmaa , if we could narrow down "who" is creating these files (e.g. "/usr/local/airflow/dags/dbt_bfp/models/paid_media/audiences/om_20240401_fl_scotus_abortion_showbuy_person/._om_20240401_fl_scotus_abortion_showbuy_person.yml.DGBDnb").
We could make this folder: /usr/local/airflow/dags/dbt/
read-only to the process that is running Airflow (Astro).
By making the original folder read-only. There are two options:
- Whichever process is trying to create these files will fail, and hopefully, the traceback will be clear enough to determine who this process is. This would allow us to confirm if it is dbt-core or another library and help us understand the possibilities
- The process of creating these files in the original folder will be smart enough to handle the read-only source folder, and it will work. In this case, your problem would be solved without adding retries or changing the concurrency.
From an implementation perspective in Cosmos itself, we could adopt one of the following strategies to mitigate the problem:
- Stop creating a symbolic link to the
/models
folder and create a new temp folder for/models
, symbolic linking only sql and yml files inside of it - Have Cosmos trying to identify this error message in the command output, and if it happens, retry within Cosmos itself