microsoft / farmvibes-ai

FarmVibes.AI: Multi-Modal GeoSpatial ML Models for Agriculture and Sustainability

Home Page:https://microsoft.github.io/farmvibes-ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issue with SpaceEye Pipeline Stalling on Rerun

click2cloud-sagarB opened this issue · comments

In which step did you encounter the bug?

Workflow execution

Are you using a local or a remote (AKS) FarmVibes.AI cluster?

Local cluster

Bug description

Dear Farmvibes Team,
In the most recent release of farmvibes, we've noticed that the SpaceEye pipeline stalls on rerun at tasks "spaceeye.preprocess.cloud.cloud" and "spaceeye.preprocess.cloud.shadow." and all subsequent jobs in the spaceeye pileine were stuck as well.
Please let me know if you need anything more from my end.

Steps to reproduce the problem

No response

Hi, @click2cloud-sagarB. Thank you for raising the issue. A few doubts and requests:

  • When you say "stalls", what is the status of the tasks? Are both cloud and shadow tasks marked as queued? Could you provide a screenshot of the client.monitor() table?

  • Is this happening if you run the workflow on a new region/time range that is not cached?

  • Are you passing the Planetary Computer Key as a parameter to the workflow? We recently changed that and the PC key is now a required parameter.

  • Could you provide the logs for the orchestrator, cache. and workers so we can investigate what might be happening?
    They are located in the logs folder of your storage (e.g., ~/.cache/farmvibes-ai/logs).

Hi @rafaspadilha, Below are the answers to your queries.

  • Cloud and shadow is in running state, while all subsequent jobs are in the pending status. I have shared the log file and screenshot of client.monitor().

  • It does not occur when we perform the workflow on a new region/time range that is not cached. When we execute the workflow for the first time in a new region or time period, it completes successfully.

  • No we do not pass Planetary Computer Key as a parameter to the workflow data_ingestion/spaceeye/spaceeye_interpolation. But I'm not sure whether it was mandatory because if it was, I doubt the workflow would have completed smoothly the first time around.

  • I have attached log files for your reference
    logs.zip

1

Hey, @click2cloud-sagarB. Looking through the logs, it seems like there was an error during communication between the orchestrator and cache pods.

Please, could you try deleting the cache pod and re-run the workflows?

You can do that with:

$ ~/.config/farmvibes-ai/kubectl delete pods -l app=terravibes-cache

Let me know if this solves the issue.

Hi @rafaspadilha deleting the cache pod temporarily solves the issue for one rerun but it reoccurs in next rerun.