Netflix / metaflow

:rocket: Build and manage real-life ML, AI, and data science projects with ease!

Home Page:https://metaflow.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data store error

jareenramuk opened this issue · comments

I have been working on metaflow AWS batch and all of a sudden from the past few days I have been getting
"Data store error: No completed attempts of the task was found for task" at random places in the flow.

Sometimes it pops up in the start step and sometimes in the end step. I have observed this is Foreach , Flowspec parallel_map. We are using metaflow version 2.10.7

I'm running multiple jobs in AWS batch using metaflow. And recently I too have been noticing there are sporadic failures of jobs with exit code 137 in AWS batch.

I understand exit code 137 indicates memory issue, but we are seeing this error occasionally.

We tested by giving 16 GB memory and 128 GB memory for 2 jobs and passing same payload. It passed for the job with 16 GB once but it failed for 128 GB, so we are not sure if it's actually a memory issue.

Is there any chance that this is a metaflow related issue because the error we are seeing is:

Data store error: 
No completed attempts of the task was found for task

I checked in task_datastory.py file of this repository and noticed this error is thrown if 'Done.lock' file is not created.

We tested this with various of metaflow including 2.9.11 and 2.10.7 and we are seeing the same error on all of these versions.

@bjupreti are you able to replicate this error consistently?

No, I'm not able to replicate it consistently. The same batch job with same compute environment, resources and payload passes sometimes and fails sometimes.

In my case, I checked the logs of EC2 instances, as part of EC2 startup SSM scripts were running which resulted in restarting the docker daemon and the running job container gets stopped. ECS service comes back up, sees the container was stopped, and informs AWS that the job failed. Once the SSM scripts were not run on the instances, ECS agent service did not restart and I'm not seeing the above error anymore.