Issue with SpaceEye Pipeline Stalling on Rerun

Question

Issue with SpaceEye Pipeline Stalling on Rerun

click2cloud-sagarB opened this issue 2 months ago · comments

Click2Cloud-Sagar Bangare commented 2 months ago

In which step did you encounter the bug?

Workflow execution

Are you using a local or a remote (AKS) FarmVibes.AI cluster?

Local cluster

Bug description

Dear Farmvibes Team,
In the most recent release of farmvibes, we've noticed that the SpaceEye pipeline stalls on rerun at tasks "spaceeye.preprocess.cloud.cloud" and "spaceeye.preprocess.cloud.shadow." and all subsequent jobs in the spaceeye pileine were stuck as well.
Please let me know if you need anything more from my end.

Steps to reproduce the problem

No response

Rafael Soares Padilha · Answer 1 · Mon Apr 29 2024 21:16:13 GMT+0800 (China Standard Time)

Hi, @click2cloud-sagarB. Thank you for raising the issue. A few doubts and requests:

When you say "stalls", what is the status of the tasks? Are both cloud and shadow tasks marked as queued? Could you provide a screenshot of the client.monitor() table?
Is this happening if you run the workflow on a new region/time range that is not cached?
Are you passing the Planetary Computer Key as a parameter to the workflow? We recently changed that and the PC key is now a required parameter.
Could you provide the logs for the orchestrator, cache. and workers so we can investigate what might be happening?
They are located in the logs folder of your storage (e.g., ~/.cache/farmvibes-ai/logs).

Click2Cloud-Sagar Bangare · Answer 2 · Tue Apr 30 2024 14:09:09 GMT+0800 (China Standard Time)

Hi @rafaspadilha, Below are the answers to your queries.

Cloud and shadow is in running state, while all subsequent jobs are in the pending status. I have shared the log file and screenshot of client.monitor().
It does not occur when we perform the workflow on a new region/time range that is not cached. When we execute the workflow for the first time in a new region or time period, it completes successfully.
No we do not pass Planetary Computer Key as a parameter to the workflow data_ingestion/spaceeye/spaceeye_interpolation. But I'm not sure whether it was mandatory because if it was, I doubt the workflow would have completed smoothly the first time around.
I have attached log files for your reference
logs.zip

Rafael Soares Padilha · Answer 3 · Fri May 03 2024 20:35:51 GMT+0800 (China Standard Time)

Hey, @click2cloud-sagarB. Looking through the logs, it seems like there was an error during communication between the orchestrator and cache pods.

Please, could you try deleting the cache pod and re-run the workflows?

You can do that with:

$ ~/.config/farmvibes-ai/kubectl delete pods -l app=terravibes-cache

Let me know if this solves the issue.

Click2Cloud-Sagar Bangare · Answer 4 · Mon May 13 2024 21:12:43 GMT+0800 (China Standard Time)

Hi @rafaspadilha deleting the cache pod temporarily solves the issue for one rerun but it reoccurs in next rerun.