microsoft / farmvibes-ai

FarmVibes.AI: Multi-Modal GeoSpatial ML Models for Agriculture and Sustainability

Home Page:https://microsoft.github.io/farmvibes-ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

spaceeye.spaceeye.spaceeye in crop segementation notebook crashes the cluster

Amr-MKamal opened this issue · comments

After setting up Crop Segmentation environment & running the notebook Crop Segmentation multiple times , it now causes the farmvibes-ai (kube) cluster to fail, it becomes unresponsive for 30000 port , yet checking farmvibes-ai.sh status shows it running and restart doesn't work the terminal hangs unresponsively , after using farmvibes-ai.sh stop then start solves the issue yet after a while the same issue reproduces itself .
This is being tried on an azure VM (Standard D16s v3 )( 16 core, 64 GB ram , 2 TB SSD)
Steps to reproduce :
1.$ conda env create -f ./crop_env.yaml
2.$ conda activate crop-seg
3.$ jupyter notebook --allow-root -ip 0.0.0.0
4. then from jupyter notebook run 01_dataset_generation
this has been repeated multiple times using $ farmvibes-ai.sh stop then $ farmvibes-ai.sh start , the current free space at this point is 133GB
cropsegerror
it would be a great help if we have an idea of how long this task(similar data generative tasks) takes to complete on average on (Standard D16s v3 )

Update 2: after scaling the VM SSD again to ~ 4TB & restarting the cluster , it has kept running successfully since then , yet the cluster seems to have taken an extra ~700GB so far , so the suggested 500GB is clearly not enough for the given example

Update 3: the spaceeye.spaceeye.spaceeye has finally finished , yet upon complete it through that error & the workflow failed ,
"reason": "KeyError: 'b806ef61-0164-4ef9-9cec-ccda9682ac9c'\n File \"/opt/conda/lib/python3.8/site-packages/vibe_common/messaging.py\", line 491, in accept_or_fail_event\n return success_callback(message)\n\n File \"/opt/conda/lib/python3.8/site-packages/vibe_server/orchestrator.py\", line 422, in success_callback\n self.inqueues[str(message.run_id)].put(message)\n",
After rerunning the same workflow the task spaceeye.spaceeye.spaceeye was done & spaceeye.spaceeye.split was queued pending nothing , So I restarted the cluster , yet after restarting it responds now with HTTPError: 500 Server Error for both the API & python client , stop & start also get's stuck here
image

The KeyError you saw was due to the restart of the cluster, in which the orchestrator lost the context about the existing run.

Can you please share the log files stored in ~/.cache/farmvibes-ai/logs? Also, if possible, please also share the output of docker logs k3d-farmvibes-ai-server-0 and df -h.

Thanks!

that makes sense because eventually everything I ran was queued pending nothing , which file exactly ?

I will assume "terravibes- orchestrator .log" " it's a very large file if you're looking for some error it will be easier to search

{"app_id": "terravibes-orchestrator", "instance": "terravibes-orchestrator-656cbf6fff-kjb9b", "level": "INFO", "msg": "Changed op spaceeye.spaceeye.group_s2 status to pending. (run id: 76fd498a-c476-492b-9735-d5e9bab49f2c)", "scope": "vibe_server.orchestrator.WorkflowStateUpdate", "time": "2023-02-28 09:21:15,480", "type": "log", "ver": "dev"}
{"app_id": "terravibes-orchestrator", "instance": "terravibes-orchestrator-656cbf6fff-kjb9b", "level": "ERROR", "msg": "Marking op spaceeye.spaceeye.group_s2 as failed, but it didn't have a start time set. (run id: 76fd498a-c476-492b-9735-d5e9bab49f2c)", "scope": "vibe_server.orchestrator.WorkflowStateUpdate", "time": "2023-02-28 09:21:15,480", "type": "log", "ver": "dev"}
{"app_id": "terravibes-orchestrator", "instance": "terravibes-orchestrator-656cbf6fff-kjb9b", "level": "INFO", "msg": "Workflow 76fd498a-c476-492b-9735-d5e9bab49f2c changed status to failed\nReason: WORKFLOW_FAILED status propagated from workflow level", "scope": "vibe_server.orchestrator.WorkflowStateUpdate", "time": "2023-02-28 09:21:15,480", "type": "log", "ver": "dev"}
{"app_id": "terravibes-orchestrator", "instance": "terravibes-orchestrator-656cbf6fff-kjb9b", "level": "DEBUG", "msg": "Starting new HTTP connection (1): 127.0.0.1:3500", "scope": "urllib3.connectionpool", "time": "2023-02-28 09:21:15,481", "type": "log", "ver": "dev"}
{"app_id": "terravibes-orchestrator", "instance": "terravibes-orchestrator-656cbf6fff-kjb9b", "level": "DEBUG", "msg": "http://127.0.0.1:3500 \"GET /v1.0/state/statestore/76fd498a-c476-492b-9735-d5e9bab49f2c?metadata.partitionKey=eywa HTTP/1.1\" 200 2967", "scope": "urllib3.connectionpool", "time": "2023-02-28 09:21:15,482", "type": "log", "ver": "dev"}
{"app_id": "terravibes-orchestrator", "instance": "terravibes-orchestrator-656cbf6fff-kjb9b", "level": "DEBUG", "msg": "Starting new HTTP connection (1): 127.0.0.1:3500", "scope": "urllib3.connectionpool", "time": "2023-02-28 09:21:15,493", "type": "log", "ver": "dev"}
{"app_id": "terravibes-orchestrator", "instance": "terravibes-orchestrator-656cbf6fff-kjb9b", "level": "DEBUG", "msg": "http://127.0.0.1:3500 \"POST /v1.0/state/statestore HTTP/1.1\" 204 0", "scope": "urllib3.connectionpool", "time": "2023-02-28 09:21:15,494", "type": "log", "ver": "dev"}
{"app_id": "terravibes-orchestrator", "instance": "terravibes-orchestrator-656cbf6fff-kjb9b", "level": "INFO", "msg": "Changed op spaceeye.preprocess.cloud.merge status to pending. (run id: 76fd498a-c476-492b-9735-d5e9bab49f2c)", "scope": "vibe_server.orchestrator.WorkflowStateUpdate", "time": "2023-02-28 09:21:15,496", "type": "log", "ver": "dev"}
{"app_id": "terravibes-orchestrator", "instance": "terravibes-orchestrator-656cbf6fff-kjb9b", "level": "ERROR", "msg": "Marking op spaceeye.preprocess.cloud.merge as failed, but it didn't have a start time set. (run id: 76fd498a-c476-492b-9735-d5e9bab49f2c)", "scope": "vibe_server.orchestrator.WorkflowStateUpdate", "time": "2023-02-28 09:21:15,496", "type": "log", "ver": "dev"}

and for dh -f

Filesystem      Size  Used Avail Use% Mounted on
/dev/root       3.9T  2.7T  1.3T  70% /
tmpfs            32G  4.0K   32G   1% /dev/shm
tmpfs            13G  1.9M   13G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda15      105M  6.1M   99M   6% /boot/efi
/dev/sdb1       126G   28K  120G   1% /mnt
tmpfs           6.3G  128K  6.3G   1% /run/user/124
tmpfs           6.3G  140K  6.3G   1% /run/user/1000

this after I destroyed the cluster & renamed the old cash file ( the only way new workflows can work )
and for docker logs k3d-farmvibes-ai-server-0

Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"
I0228 22:41:04.152242       7 trace.go:205] Trace[81119723]: "GuaranteedUpdate etcd3" type:*core.ConfigMap (28-Feb-2023 22:41:03.618) (total time: 533ms):
Trace[81119723]: ---"Transaction committed" 533ms (22:41:04.152)
Trace[81119723]: [533.657546ms] [533.657546ms] END
I0228 22:41:04.152421       7 trace.go:205] Trace[956170375]: "Update" url:/api/v1/namespaces/dapr-system/configmaps/operator.dapr.io,user-agent:operator/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election,audit-id:2e418883-fb2b-443c-a6c2-6fcd08f30186,client:10.42.0.62,accept:application/json, */*,protocol:HTTP/2.0 (28-Feb-2023 22:41:03.618) (total time: 534ms):
Trace[956170375]: ---"Object stored in database" 533ms (22:41:04.152)
Trace[956170375]: [534.096151ms] [534.096151ms] END
W0228 22:41:07.369204       7 info.go:53] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"

update 4 ; as a last try I reran the workflow again with much smaller input region inside the given input region ( only 20 km ) luckily it worked & completed succefully in 7 hours

I think you can attach the files if you drag and drop them to the text area.

For the log files, at least the terravibes-orchestrator.log, but ideally all of them in the directory.

For the output of docker logs, you can save it to a file with docker logs k3d-farmvibes-ai-server-0 > docker-logs.txt.

docker-logs.txt

I already tried that : terravibes-orchestrator.log is 15.3 MB file , github refused it even after changing it to .txt , also terravibes-restapi.log is around 700 MB , if this files are still important I can try public clouds to upload them

Update 5 : as explained above I was able to navigate around the error, specially working with smaller areas , however I still get this error when the resources already exists, which eventually requires the notebook to be run multiple times , in the next run the task will be done immediately , this makes it harder to integrate with as an API "reason": "RuntimeError: Received unsupported message header=MessageHeader(type=<MessageType.error: 'error'>, run_id=UUID('18e9dbdc-1f9c-4ffa-bd85-4455e2a0518c'), id='00-18e9dbdc1f9c4ffabd854455e2a0518c-140e43094eee796e-01', parent_id='00-18e9dbdc1f9c4ffabd854455e2a0518c-b87c84a62262a97c-01', version='1.0', created_at=datetime.datetime(2023, 3, 8, 23, 38, 16, 110840)) content=ErrorContent(status=<OpStatusType.failed: 'failed'>, ename=\"<class 'RuntimeError'>\", evalue='Traceback (most recent call last):\\n File \"/opt/conda/lib/python3.8/site-packages/vibe_agent/worker.py\", line 123, in run_op\\n return factory.build(spec).run(input, cache_info)\\n File \"/opt/conda/lib/python3.8/site-packages/vibe_agent/ops.py\", line 99, in run\\n items_out = self.storage.store(run_id, stac_results, cache_info)\\n File \"/opt/conda/lib/python3.8/site-packages/vibe_agent/storage/local_storage.py\", line 135, in store\\n raise LocalResourceExistsError(\\nvibe_agent.storage.local_storage.LocalResourceExistsError: Op output already exists in storage for download_sentinel2_from_pc with id 23bdca2b-99ce-498e-bd19-ced731cc545e.\\n', traceback=[' File \"/opt/conda/lib/python3.8/site-packages/vibe_agent/worker.py\", line 309, in run_op_from_message\\n out = self.run_op_with_retry(content, message.run_id)\\n', ' File \"/opt/conda/lib/python3.8/site-packages/vibe_agent/worker.py\", line 402, in run_op_with_retry\\n raise RuntimeError(\"\".join(ret.format()))\\n']). Aborting execution.", "status": "failed"