astronomer / astronomer-cosmos

Run your dbt Core projects as Apache Airflow DAGs and Task Groups with a few lines of code

Home Page:https://astronomer.github.io/astronomer-cosmos/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Bug] /tmp/ File Not Found Error Causing Task Failure for dbt Cosmos Tasks

oliverrmaa opened this issue · comments

commented

Astronomer Cosmos Version

Other Astronomer Cosmos version (please specify below)

If "Other Astronomer Cosmos version" selected, which one?

1.4.3

dbt-core version

1.7.17

Versions of dbt adapters

dbt-bigquery==1.7.4
dbt-core==1.7.17
dbt-extractor==0.5.1
dbt-semantic-interfaces==0.4.4

LoadMode

DBT_LS

ExecutionMode

LOCAL

InvocationMode

SUBPROCESS

airflow version

apache-airflow==2.9.2+astro.1

Operating System

Debian GNU/Linux 11 (bullseye)

If a you think it's an UI issue, what browsers are you seeing the problem on?

No response

Deployment

Astronomer

Deployment details

We have a main production deployment in Astro Cloud which we consider as production. We also do local development via astro dev start. We have continuous deployment set up through CircleCI which deploys merged PRs to our master branch to our production deployment via astro deploy --dags. For authentication to our data warehouse (Google BigQuery) in production, we use GoogleCloudServiceAccountDictProfileMapping and for local we use ProfileConfig where our dbt profiles.yml has a hardcoded path to a service account JSON file which is at the same path for each developer.

What happened?

We are still intermittently seeing FileNotFoundError: [Errno 2] No such file or directory for /tmp files every few hours or so across multiple DAGs ever since the inception of our Astronomer/Cosmos setup. This error appears on Cosmos created dbt model run tasks. This issue affects our on-call personnel because they have to manually clear and re-run these tasks in order for the model to successfully run (the re-run usually succeeds). Some model runs must be manual re-run in order for this task to succeed and some may recover on their own.

Relevant log output

Here are four examples of errors in log output for different missing /tmp/ files: 

(1)
[2024-06-24, 18:32:45 UTC] {subprocess.py:94} INFO - FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp4yx6m8en/package-lock.yml'

(2) This is a typical example for one of our models: 
"FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp2pcqnmxp/models/frontroom/business_models/provider_pay/datamarts/.provider_performance.sql.JjagaL'", ''); 274)

(3) This is a typical example for one of our models:
2024-06-21, 10:17:35 UTC] {log.py:232} WARNING - [2024-06-21T10:17:35.702+0000] {subprocess.py:94} INFO - (astronomer-cosmos) - FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmph56xe15q/models/frontroom/business_models/datamarts/honest_heatlh/.honest_health_monthly_subscription_snapshots.sql.KDCDEl'
[2024-06-21, 10:17:35 UTC] {subprocess.py:94} INFO - FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmph56xe15q/models/frontroom/business_models/datamarts/honest_heatlh/.honest_health_monthly_subscription_snapshots.sql.KDCDEl'

(4) This example is for external models we use from the dbt qualtrics package: 
[Errno 2] No such file or directory: '/tmp/tmpf6h80niz/models/intermediate/shipment/.int_easy_post_tracking.sql.GgjJgL'
06:02:42  Traceback (most recent call last):
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/cli/requires.py", line 91, in wrapper
    result, success = func(*args, **kwargs)
                      ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/cli/requires.py", line 76, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/cli/requires.py", line 169, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/cli/requires.py", line 198, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/cli/requires.py", line 245, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/cli/requires.py", line 271, in wrapper
    ctx.obj["manifest"] = parse_manifest(
                          ^^^^^^^^^^^^^^^
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/parser/manifest.py", line 1790, in parse_manifest
    manifest = ManifestLoader.get_full_manifest(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/parser/manifest.py", line 318, in get_full_manifest
    manifest = loader.load()
               ^^^^^^^^^^^^^
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/parser/manifest.py", line 365, in load
    file_reader.read_files()
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/parser/read_files.py", line 203, in read_files
    self.read_files_for_project(project, file_types)
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/parser/read_files.py", line 210, in read_files_for_project
    project_files[file_type_info["parser"]] = read_files_for_parser(
                                              ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/parser/read_files.py", line 158, in read_files_for_parser
    source_files = get_source_files(
                   ^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/parser/read_files.py", line 134, in get_source_files
    fp_list = filesystem_search(project, paths, extension, ignore_spec)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/parser/search.py", line 74, in filesystem_search
    for result in find_matching(root, relative_dirs, ext, ignore_spec):
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/clients/system.py", line 79, in find_matching
    modification_time = os.path.getmtime(absolute_path)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen genericpath>", line 55, in getmtime
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpf6h80niz/models/intermediate/shipment/.int_easy_post_tracking.sql.GgjJgL'

How to reproduce

I believe having our setup should reproduce it as the issue happens quite often (once every few hours). We also have 1000+ dbt models so an example project of that size may be needed for reproducibility. If underlying deployment infrastructure information is required here it is:

image image

Anything else :)?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Contact Details

oma@forhims.com

To resolve the intermittent FileNotFoundError: [Errno 2] No such file or directory for /tmp files that is causing task failures for dbt Cosmos tasks in your Astronomer deployment, you should ensure that the temporary directories and files are properly managed and exist when needed.

Here are some steps to address this issue:

  1. Manage Temporary Directories Properly:
    Ensure that the temporary directories are not cleaned up prematurely. Here is a revised version of the function to ensure the temporary directory is properly managed:

    import tempfile
    import shutil
    from pathlib import Path
    
    def failing_test_dbt_project(tmp_path):
        with tempfile.TemporaryDirectory() as tmp_dir:
            tmp_dir_path = Path(tmp_dir) / "mini"
            shutil.copytree(MINI_DBT_PROJ_DIR, tmp_dir_path)
            target_schema = tmp_dir_path / "models/schema.yml"
            if target_schema.exists():
                os.remove(target_schema)
            shutil.copy(MINI_DBT_PROJ_DIR_FAILING_SCHEMA, target_schema)
            yield tmp_dir_path
  2. Set the TMPDIR Environment Variable:
    Ensure that the TMPDIR environment variable is set correctly, especially on MacOS, to avoid the value of the temp directory changing unexpectedly. You can configure this in your cosmos/settings.py:

    import os
    import tempfile
    from pathlib import Path
    
    import airflow
    from airflow.configuration import conf
    
    from cosmos.constants import DEFAULT_COSMOS_CACHE_DIR_NAME, DEFAULT_OPENLINEAGE_NAMESPACE
    
    # In MacOS users may want to set the envvar `TMPDIR` if they do not want the value of the temp directory to change
    DEFAULT_CACHE_DIR = Path(tempfile.gettempdir(), DEFAULT_COSMOS_CACHE_DIR_NAME)
    cache_dir = Path(conf.get("cosmos", "cache_dir", fallback=DEFAULT_CACHE_DIR) or DEFAULT_CACHE_DIR)
    enable_cache = conf.getboolean("cosmos", "enable_cache", fallback=True)
    enable_cache_partial_parse = conf.getboolean("cosmos", "enable_cache_partial_parse", fallback=True)
    enable_cache_dbt_ls = conf.getboolean("cosmos", "enable_cache_dbt_ls", fallback=True)
    propagate_logs = conf.getboolean("cosmos", "propagate_logs", fallback=True)
    dbt_docs_dir = conf.get("cosmos", "dbt_docs_dir", fallback=None)
    dbt_docs_conn_id = conf.get("cosmos", "dbt_docs_conn_id", fallback=None)
    dbt_docs_index_file_name = conf.get("cosmos", "dbt_docs_index_file_name", fallback="index.html")
    enable_cache_profile = conf.getboolean("cosmos", "enable_cache_profile", fallback=True)
    dbt_profile_cache_dir_name = conf.get("cosmos", "profile_cache_dir_name", fallback="profile")
    
    try:
        LINEAGE_NAMESPACE = conf.get("openlineage", "namespace")
    except airflow.exceptions.AirflowConfigException:
        LINEAGE_NAMESPACE = os.getenv("OPENLINEAGE_NAMESPACE", DEFAULT_OPENLINEAGE_NAMESPACE)
  3. Handle Temporary Directory in run_command Function:
    Ensure that the run_command function in cosmos/dbt/graph.py is correctly handling the temporary directory and environment variables:

    def run_command(command: list[str], tmp_dir: Path, env_vars: dict[str, str]) -> str:
        """Run a command in a subprocess, returning the stdout."""
        logger.info("Running command: `%s`", " ".join(command))
        logger.debug("Environment variable keys: %s", env_vars.keys())
        process = Popen(
            command,
            stdout=PIPE,
            stderr=PIPE,
            cwd=tmp_dir,
            universal_newlines=True,
            env=env_vars,
        )
        stdout, stderr = process.communicate()
        returncode = process.returncode
    
        if 'Run "dbt deps" to install package dependencies' in stdout and command[1] == "ls":
            raise CosmosLoadDbtException(
                "Unable to run dbt ls command due to missing dbt_packages. Set RenderConfig.dbt_deps=True."
            )
    
        if returncode or "Error" in stdout.replace("WarnErrorOptions", ""):
            details = stderr or stdout
            raise CosmosLoadDbtException(f"Unable to run {command} due to the error:\n{details}")
    
        return stdout

By ensuring the temporary directory is properly managed, configuring the TMPDIR environment variable, and correctly handling the temporary directory in the run_command function, you can mitigate the FileNotFoundError issue [1][2][3].

To continue talking to Dosu, mention @dosu.

Hi, @oliverrmaa. I'm still trying to reproduce this issue, but I have a few hypotheses, and we can try a few things out. We'll soon have an alpha release to validate one of these hypotheses.

The errors you are seeing are raised by dbt-core itself. Cosmos 1.4.3 does not have any specific knowledge of any of the files mentioned, including package-lock.yml.

We create a temporary directory for every dbt command that will be run. This was to overcome two main issues:

  1. Read-only filesystems: many users needed help to write in the original dbt project folder. Older versions of Cosmos would create a full copy of the original directory. This was changed in the last Cosmos versions to be a symbolic link, where applicable (available from 1.1 until now, 1.5).
  2. dbt concurrency: many dbt-core commands were not designed to be run concurrently, as described in their official docs. So, running them from independent directories would help (avoid issues).

Some follow-up questions:
(i) Have you observed if any of these errors are affected by deployments to Astro?
(ii) How is your disk consumption at the moment?
(iii) Do you have task retries enabled for these DAGs? What is the current amount of retries?
(iv) Would you consider having more and more minor Astro instances with smaller concurrency and checking if that helps mitigate the issue?
(v) Which data warehouse are you using?
(vi) Which version of dbt-core are you using?
(vii) Are you currently running dbt deps as part of the task runs? Are these errors happening when dbt deps or the central dbt command (e.g. dbt run) is executed?
(viii) The files that are mentioned to be missing (e.g. .provider_performance.sql.JjagaL, .honest_health_monthly_subscription_snapshots.sql.KDCDEl), are they part of your original project, are they dependencies or are they being created by dbt-core dynamically (e.g. compiled SQL files, that may use macros)?

There are two low-hanging fruits I can see:

a) Assuming the issue may be with the creation of symbolic links, we can create an alpha version of Cosmos that avoids creating those, runs the commands from the original folder, and uses environment variables for dbt-core JSON artifacts and dbt-core logs.

b) We could check if this error message ("[Errno 2] No such file or directory") happened within the dbt task run, and attempt to run it as part of the same task run in Cosmos itself. However, it does feel that configuring the task_retry in Airflow may be the most suitable place.

If we implement (a) and (b), you confirm that reducing the concurrency in smaller Airflow /Astro instances leads to the same error. Another path we can follow to improve the concurrent executions of dbt tasks is not to use dbt to run the SQL but to have Airflow operators running the compiled SQL. This is a strategy companies who had to scale dbt in Airflow used in their proprietary solutions (e.g., Monzo), and were quite successful. The downside is that some macros may not work as expected, but this can be a good compromise. The downside of this strategy is that we'll have some effort per Dataware house, but it may pay off.

This issue seems to be affecting several users from the OpenSource community and Astro customers.

Feedback from @oliverrmaa :

First of all our dbt-bigquery version is 1.7.4 and dbt core version is 1.7.17 and in the dbt_project.yml , no we do not override any paths
The rest of the answers:
i) appear unaffected. They can occur even when there are no deployments (i.e. weekend)
ii) how do we check that?
iii) we rolled it out for 1 DAG (set it at 2) and it was very promising (i think the retries prevented it from being an issue for a > ~week), doesn't tell us root cause but addresses issue. We may roll it out to all DAGs.
iv) potentially open to this
v) Goolge BigQuery
vi) dbt core version is 1.7.17
vii) yup we do, we use operator_args={"install_deps": True} in our DbtTaskGroup()
viii) definitely not part of the original project, they are dependencies created dynamically. They do use macros like config() , source() , but so do other dbt models. We also have a custom macro that is on every single dbt model but its just a model that adds LIMIT 0 if the target is our CI environment

On (ii), I'd recommend the following steps:

  1. Open the Deployment in the Astro UI.
  2. Click on Overview.
  3. Click Get Analytics.
  4. You can view the following metrics related to ephemeral storage:
    Ephemeral storage usage (Kubernetes Executor/KubernetesPodOperator): This metric shows how your Kubernetes tasks use available ephemeral storage as a percentage of the total ephemeral storage configured. You can adjust the graph's y-axis to better fit your data or zoom in to view details.
    Ephemeral storage usage (Celery workers): This metric shows how your Celery worker uses available ephemeral storage as a percentage of the total ephemeral storage configured. You can also adjust the graph's y-axis to better fit your data or zoom in to view details [1].

After a few hours of analyzing and troubleshooting this issue, I have a few conclusions:

Cause

Either dbt-core or another related library creates temporary files in the source models folder. I read part of the dbt-core source code, and it was unclear where this happened. This was a surprise since I expected artifacts to be created in the target and logs folders, not in the source folders. This behavior may be associated with Jinja 2 caching or some macro or dbt adaptor. What is clear is that this is not caused by Cosmos (earlier this week I spoke to another library user to run dbt in Airflow who was facing the same issue). However, this is very likely to happen in environments like Airflow, where many concurrent "dbt" commands are run.

MItigations

There are two workarounds to the problem:

a) Use Airflow task_retry in the DbtDag / DbtTaskGroup that this is happening. As confirmed with the customer, by setting two, they no longer experienced this issue since it is unlike that this concurrency issue will happen many times in a row for the same task. It is not guaranteed, but it minimizes and works with any version of Cosmos.

b) Reduce task concurrency at the Airflow worker nodes to 1 in the node pools running Cosmos tasks, and have very small worker nodes, with a larger autoscaling upper limit.

c) Use KubernetesExecutor (no state would be shared between task executions)

d) Use Cosmos ExecutionMode.KUBERNETES

Follow ups

We were not able to reproduce this problem. But it would be great, @oliverrmaa , if we could narrow down "who" is creating these files (e.g. "/usr/local/airflow/dags/dbt_bfp/models/paid_media/audiences/om_20240401_fl_scotus_abortion_showbuy_person/._om_20240401_fl_scotus_abortion_showbuy_person.yml.DGBDnb").

We could make this folder: /usr/local/airflow/dags/dbt/ read-only to the process that is running Airflow (Astro).

By making the original folder read-only. There are two options:

  1. Whichever process is trying to create these files will fail, and hopefully, the traceback will be clear enough to determine who this process is. This would allow us to confirm if it is dbt-core or another library and help us understand the possibilities
  2. The process of creating these files in the original folder will be smart enough to handle the read-only source folder, and it will work. In this case, your problem would be solved without adding retries or changing the concurrency.

From an implementation perspective in Cosmos itself, we could adopt one of the following strategies to mitigate the problem:

  1. Stop creating a symbolic link to the /models folder and create a new temp folder for /models, symbolic linking only sql and yml files inside of it
  2. Have Cosmos trying to identify this error message in the command output, and if it happens, retry within Cosmos itself