Cosmos tasks randomly getting marked as zombie tasks

Question

Cosmos tasks randomly getting marked as zombie tasks

1cadumagalhaes opened this issue 2 months ago · comments

Carlos Eduardo Magalhães commented 2 months ago

Not sure if this should be here, but it happens mostly with cosmos dags.

My airflow instance is in a kubernetes cluster, so this might be a problem with the scheduler, not with cosmos itself. I'm using CeleryExecutor right now, might switch these dags to KubernetesExecutor to see if it works.

But anyways, if anyone knows how to help it would be great

[2024-05-27, 22:00:37 UTC] {task_context_logger.py:104} ERROR - Detected zombie job: {'full_filepath': '/opt/airflow/dags/repo/dags/dags/dbt/rtz.py', 'processor_subdir': None, 'msg': "{'DAG Id': 'transform___dbt-instagram__dev', 'Task Id': 'instagram_models.all_media_insights_run', 'Run Id': 'scheduled__2024-05-27T21:20:00+00:00', 'Hostname': 'airflow-worker-1.airflow-worker.orchestration.svc.cluster.local', 'External Executor Id': '72425eed-2b53-4d22-81f0-fee59bbb3c99'}", 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at 0x7f856ebd3010>, 'is_failure_callback': True} (See https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/tasks.html#zombie-undead-tasks)

Sometimes it is marked as zombie task even if it did execute correctly

dosubot · Answer 1 · Tue May 28 2024 06:12:47 GMT+0800 (China Standard Time)

Switching to KubernetesExecutor might help resolve the issue of tasks being marked as zombie tasks, especially since your Airflow instance is running in a Kubernetes cluster.

Here's an example of how to define a DAG using DbtRunKubernetesOperator:

from airflow import DAG
from airflow.utils.dates import days_ago
from cosmos.operators.kubernetes import DbtRunKubernetesOperator

default_args = {
    'owner': 'airflow',
    'start_date': days_ago(1),
}

with DAG(
    dag_id='example_dbt_kubernetes_dag',
    default_args=default_args,
    schedule_interval='@daily',
    catchup=False,
) as dag:

    run_dbt_task = DbtRunKubernetesOperator(
        task_id='run_dbt_task',
        profile_config=None,  # Add your profile config if needed
        namespace='default',  # Kubernetes namespace
        image='your-dbt-image',  # Docker image for dbt
        cmds=["dbt", "run"],  # dbt command
        name='dbt-run-pod',
        in_cluster=True,
        is_delete_operator_pod=True,
        get_logs=True,
    )

    run_dbt_task

Update your airflow.cfg to use KubernetesExecutor:

[core]
executor = KubernetesExecutor

Restart your Airflow services after making these changes. This should help mitigate the issue of tasks being marked as zombie tasks ^[1].

Details

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

Pankaj Singh · Answer 2 · Tue May 28 2024 20:34:36 GMT+0800 (China Standard Time)

Which version of Cosmos are you using?

Tatiana Al-Chueyr · Answer 3 · Thu Jun 06 2024 17:59:52 GMT+0800 (China Standard Time)

@1cadumagalhaes could you also confirm the Cosmos LoadMethod you're using?

Carlos Eduardo Magalhães · Answer 4 · Thu Jun 13 2024 12:56:00 GMT+0800 (China Standard Time)

Sorry!
I'm using cosmos 1.3.2 @pankajastro

And I haven't set the LoadMethod, so it's automatic. I think it is rendering it using dbt_ls based on the logs @tatiana :

Unable to do partial parsing because saved manifest not found. Starting full parse.

Pankaj Singh · Answer 5 · Thu Jun 13 2024 14:38:23 GMT+0800 (China Standard Time)

Sometimes it is marked as zombie task even if it did execute correctly

Since this is not persistent, I feel dag is converted correctly so there might be an issue with the scheduler as you have pointed out. But do you see any unusual ti config for the zombie task? Also, we have released 1.4.3 which introduces some performance improvements I'm not sure if it will address this issue or not but would you like to try it once.

Daniel Reeves · Answer 6 · Thu Jun 20 2024 08:04:05 GMT+0800 (China Standard Time)

Yeah I believe this is a memory issue.

"Zombie tasks" is the error message you get on Astronomer if your celery worker sigkills.

This happens because Cosmos spins up subprocesses on Celery workers, and each of those subprocesses will exceed 100 MiB. On top of the overhead of the Celery worker itself, this adds up quite a bit and sigkills occur very frequently.

IMO, my 2 cents, we really need to add a more opinionated guide on this. Especially now with the invocation mode that runs dbt directly in the python process instead of spinning up a new subprocess, and now that dbt-core's dependency locks aren't incompatible with Airflow 2.9's, Astronomer users can easily avoid this. But there is no guide that tells people how to do that, never mind anything that opinionatedly tells users this. Cosmos is confusing and users are likely to do what they are told, so we should tell them to do the things that don't cause these sorts of issues.

Wout · Answer 7 · Thu Jul 04 2024 14:52:48 GMT+0800 (China Standard Time)

@dwreeves, any (opiniated) suggestions in terms of required resources per worker/ tasks per worker/...?

Daniel Reeves · Answer 8 · Thu Jul 04 2024 15:28:26 GMT+0800 (China Standard Time)

@w0ut0 My most opinionated suggestion is to use the DBT_RUNNER invocation mode. I'm sure the Astronomer people would like it if I suggested using bigger workers since that's more money for them 😂😅 (although real talk, if any Astronomer people are reading this, "high memory" workers would be a great feature!) but you can avoid a ton of overhead by not spawning multiple subprocesses.

Wout · Answer 9 · Thu Jul 04 2024 16:15:39 GMT+0800 (China Standard Time)

Thanks, will test if it makes a significant improvement in our tasks without increasing the worker resources!