Netflix / metaflow

:rocket: Build and manage real-life ML, AI, and data science projects with ease!

Home Page:https://metaflow.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Metaflow crashes on AWS Batch if folder called `metaflow` is present in the working directory

shchur opened this issue · comments

I have created the stack using the MetaFlow CloudFormation template.

I can run flows locally, but flows fail if I add the @batch decorator.

For example, this flow

Code
from metaflow import FlowSpec, step, batch


class HelloFlow(FlowSpec):
    @step
    def start(self):
        print("HelloFlow is starting.")
        self.next(self.hello)

    @batch
    @step
    def hello(self):
        print("Metaflow says: Hi!")
        self.next(self.end)

    @step
    def end(self):
        print("HelloFlow is all done.")


if __name__ == "__main__":
    HelloFlow()

fails with the following error message

Metaflow 2.10.3 executing HelloFlow for user:shchuro
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint not found, so extra checks are disabled.
2024-01-09 22:26:54.668 Workflow starting (run-id 1704839212206922):
2024-01-09 22:26:54.832 [1704839212206922/start/1 (pid 31890)] Task is starting.
2024-01-09 22:26:57.629 [1704839212206922/start/1 (pid 31890)] HelloFlow is starting.
2024-01-09 22:26:59.947 [1704839212206922/start/1 (pid 31890)] Task finished successfully.
2024-01-09 22:27:01.262 [1704839212206922/hello/2 (pid 32009)] Task is starting.
2024-01-09 22:27:06.207 [1704839212206922/hello/2 (pid 32009)] [bf7f4021-757d-4d11-bf01-553f2637aff1] Task is starting (status SUBMITTED)...
2024-01-09 22:27:08.452 [1704839212206922/hello/2 (pid 32009)] [bf7f4021-757d-4d11-bf01-553f2637aff1] Task is starting (status RUNNABLE)...
2024-01-09 22:27:10.692 [1704839212206922/hello/2 (pid 32009)] [bf7f4021-757d-4d11-bf01-553f2637aff1] Task is starting (status STARTING)...
2024-01-09 22:27:25.274 [1704839212206922/hello/2 (pid 32009)] [bf7f4021-757d-4d11-bf01-553f2637aff1] Task is starting (status FAILED)...
2024-01-09 22:27:27.758 [1704839212206922/hello/2 (pid 32009)] Data store error:
2024-01-09 22:27:27.759 [1704839212206922/hello/2 (pid 32009)] No completed attempts of the task was found for task 'HelloFlow/1704839212206922/hello/2'
2024-01-09 22:27:28.156 [1704839212206922/hello/2 (pid 32009)]
2024-01-09 22:27:28.554 [1704839212206922/hello/2 (pid 32009)] Task failed.
2024-01-09 22:27:28.888 Workflow failed.
2024-01-09 22:27:28.888 Terminating 0 active tasks...
2024-01-09 22:27:28.888 Flushing logs...
    Step failure:
    Step hello (task-id 2) failed.

The S3 folder for the failing step (HellowFlow/1704839212206922/hello/2) contains the following files:

  • 0.attempt.json
  • 0.runtime
  • 0.runtime_stderr.log
  • 0.runtime_stdout.log

I would really appreciate any pointers on how to debug this issue.

Hey @shchur, I wasn't able to reproduce this...attaching a screenshot of a successful run with @batch

Screenshot 2024-01-10 at 7 55 49 AM

Could you perhaps:

  • post the contents of 0.runtime_stderr.log and 0.runtime_stdout.log
  • try with a newer metaflow version i.e. 2.10.8 since you use 2.10.3

Thank you for a quick response! I tried using 2.10.8 and encountered the same error. Here are the contents of s3://****-metaflows3bucket-vqcend9aqn9z/HelloFlow/1704868945391404/hello/2/:

  • 0.runtime_stderr.log:
[MFLOG|0|2024-01-10T06:46:03.311728Z|runtime|91c794bc-1325-4cc5-98d7-3939daef119f]    Data store error:
[MFLOG|0|2024-01-10T06:46:03.311985Z|runtime|735f3f38-9a60-4978-aefb-6f69aff0d75a]    No completed attempts of the task was found for task 'HelloFlow/1704868945391404/hello/2'
[MFLOG|0|2024-01-10T06:46:03.691040Z|runtime|7f210482-078e-4217-abc9-92b1901c9e3c]
[MFLOG|0|2024-01-10T06:46:04.086122Z|runtime|e0e3f263-4ef9-404f-8196-134901d480d3]Task failed.
  • 0.runtime_stdout.log:
[MFLOG|0|2024-01-10T06:42:38.449017Z|runtime|dfcf4833-1cab-43fd-996e-4edefcc22977][9814be19-a252-4491-b6a7-3fe6b13bfa69] Task is starting (status RUNNABLE)...
[MFLOG|0|2024-01-10T06:43:08.649344Z|runtime|e2a6331b-df2c-42ec-b401-3c0463190c3a][9814be19-a252-4491-b6a7-3fe6b13bfa69] Task is starting (status RUNNABLE)...
[MFLOG|0|2024-01-10T06:43:38.852722Z|runtime|9ebac194-f505-491e-8a17-e45a25339148][9814be19-a252-4491-b6a7-3fe6b13bfa69] Task is starting (status RUNNABLE)...
[MFLOG|0|2024-01-10T06:44:09.018890Z|runtime|085c5b85-e151-4248-92b5-49cff2390b1f][9814be19-a252-4491-b6a7-3fe6b13bfa69] Task is starting (status RUNNABLE)...
[MFLOG|0|2024-01-10T06:44:39.160499Z|runtime|4d066689-d3e0-45a1-a9ee-6aaad1fe506d][9814be19-a252-4491-b6a7-3fe6b13bfa69] Task is starting (status RUNNABLE)...
[MFLOG|0|2024-01-10T06:45:09.384003Z|runtime|de486574-58a1-4c38-b52f-326f3237b4bc][9814be19-a252-4491-b6a7-3fe6b13bfa69] Task is starting (status RUNNABLE)...
[MFLOG|0|2024-01-10T06:45:39.528610Z|runtime|ee0bd5f6-3739-4242-bab8-4eedccdb964e][9814be19-a252-4491-b6a7-3fe6b13bfa69] Task is starting (status RUNNABLE)...
[MFLOG|0|2024-01-10T06:45:45.104157Z|runtime|f36f4539-ceaa-490d-a4a0-afb0ec2ec232][9814be19-a252-4491-b6a7-3fe6b13bfa69] Task is starting (status STARTING)...
[MFLOG|0|2024-01-10T06:46:00.746162Z|runtime|a324752b-a551-46e4-9db0-f347123d424c][9814be19-a252-4491-b6a7-3fe6b13bfa69] Task is starting (status FAILED)...
  • 0.runtime
{"return_code": 1, "killed": false, "success": false}
  • 0.attempt.json
{"time": 1704868953.193214}

I suspect that the problem lies in the AWS configuration, but I'm not sure how to get to its root cause. I used the CloudFormation template without any modifications, and stack creation finished with CREATE_COMPLETE status.

Are there any additional log statements that I could add to the Metaflow code to get a more informative error message / understand which exact operation is failing?

I found the source of the problem: my working directory included a folder called metaflow, which crashed the metaflow command executed during env setup.

mkdir: cannot create directory ‘metaflow’: File exists	

It might be helpful to check for presence of this folder before submitting the job and raise an informative error message to the user.

As for debugging jobs on AWS Batch, I was able to find the detailed log with error message in the Amazon Elastic Container Service console under Clusters.

My problem is solved, but I'm keeping this issue open in case you want to add an informative error message for the edge case.

Thank you for building and maintaining such an amazing framework!

@shchur I wasn't able to reproduce this, can you please share your directory structure?

Mine looks the following:

Screenshot 2024-01-23 at 6 28 08 PM

and I submit the flow with python hello.py run --with batch

@madhur-ob The problem was that I built my Docker image used by Metaflow in the same directory. In other words, if the metaflow folder is present in the working directory of the Docker image, the Batch job crashes because it cannot unpack the archive.