microsoft / farmvibes-ai

FarmVibes.AI: Multi-Modal GeoSpatial ML Models for Agriculture and Sustainability

Home Page:https://microsoft.github.io/farmvibes-ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error installing a Remote Cluster in Azure

gussabina opened this issue · comments

Hello:
I tried to install the remote cluster in Azure, but I always get these errors:

2023-12-27 08:16:58,800 - INFO - Validating that Microsoft.Network is available in the subscription selected
2023-12-27 08:17:08,269 - INFO - Validating that Microsoft.Storage is available in the subscription selected
2023-12-27 08:17:16,585 - INFO - Validating that Microsoft.Compute is available in the subscription selected
2023-12-27 08:17:25,739 - INFO - Validating that Total Regional vCPUs has enough resources in region westus
2023-12-27 08:17:31,791 - INFO - Validating that Standard DSv3 Family vCPUs has enough resources in region westus
2023-12-27 08:17:36,101 - INFO - Validating that Standard BS Family vCPUs has enough resources in region westus
2023-12-27 08:17:41,079 - INFO - Getting current user name...
2023-12-27 08:17:43,020 - INFO - Verifying cluster already exists...
2023-12-27 08:17:47,309 - ERROR - Unable to run command ['C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\wbin\az.CMD', 'aks', 'show', '-n', 'farmvibes-aks', '-g', 'restapi-rg', '-o', 'tsv'].

2023-12-27 08:17:47,309 - INFO - Will create cluster farmvibes-aks in resource group restapi-rg...
2023-12-27 08:17:49,642 - INFO - Found Signing System with id xxxxxxxx-xxxx-xxxx-xxxx-xxxxx as current subscription
Is this the correct Azure subscription you would like to use? Signing System (y/n): y
2023-12-27 08:18:43,476 - INFO - Initializing terraform in...

And setup continues...

2023-12-27 08:20:34,674 - INFO - terraform.EXE:
2023-12-27 08:20:34,674 - INFO - terraform.EXE: Apply complete! Resources: 2 added, 0 changed, 0 destroyed.
2023-12-27 08:20:34,678 - INFO - terraform.EXE:
2023-12-27 08:20:41,975 - INFO - JSONDecodeError: Expecting value: line 1 column 1 (char 0)
2023-12-27 08:20:41,976 - INFO - Failed to create cluster. Cleaning up...

Do you wish the keep the cluster (Answering 'y' will leave the cluster as is)? (y/n):

At this point, the installation fails...

I have updated the CLI and it works fine. Not sure why the first error...
Regarding the JSONDecodeError, what should I look at?

Thanks
Gus

Hi there.

Can you please share some information regarding this issue?

  1. Can you please share if your are using Command Prompt or Power Shell?
  2. Please provide the command you used to setup farmvibes-ai
  3. What is the output of this command?
C:\\Program Files (x86)\\Microsoft SDKs\\Azure\\CLI2\\wbin\x07z.CMD aks show -n farmvibes-aks -g restapi-rg -o tsv

Thanks

Hello:

These are the steps I'm using for the remote installation:

  1. In Power Shell; I login using az login

  2. I run the following command to instor the remote installation:

  3. In Power Shell; I login using az login

  4. I run the following commandall it;

farmvibes-ai remote setup --region westus --cert-email myemail@gmail.com --resource-group restapi-rg

2024-01-09 16:56:12,074 - INFO - Validating that Microsoft.DocumentDB is available in the subscription selected
2024-01-09 16:56:15,864 - INFO - Validating that Microsoft.KeyVault is available in the subscription selected
2024-01-09 16:56:19,809 - INFO - Validating that Microsoft.ContainerService is available in the subscription selected2024-01-09 16:56:23,987 - INFO - Validating that Microsoft.Network is available in the subscription selected
2024-01-09 16:56:28,149 - INFO - Validating that Microsoft.Storage is available in the subscription selected
2024-01-09 16:56:31,989 - INFO - Validating that Microsoft.Compute is available in the subscription selected
2024-01-09 16:56:36,291 - INFO - Validating that Total Regional vCPUs has enough resources in region westus
2024-01-09 16:56:39,364 - INFO - Validating that Standard DSv3 Family vCPUs has enough resources in region westus
2024-01-09 16:56:41,589 - INFO - Validating that Standard BS Family vCPUs has enough resources in region westus
2024-01-09 16:56:43,786 - INFO - Getting current user name...
2024-01-09 16:56:44,804 - INFO - Verifying cluster already exists...
2024-01-09 16:56:46,899 - ERROR - Unable to run command ['C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\wbin\az.CMD', 'aks', 'show', '-n', 'farmvibes-aks', '-g', 'restapi-rg', '-o', 'tsv'].

2024-01-09 16:56:46,900 - INFO - Will create cluster farmvibes-aks in resource group restapi-rg...
2024-01-09 16:56:47,943 - INFO - Found Signing System with id 6db07b7f-a16a-4fc6-9ecb-ab972fxxxxxx as current subscription
Is this the correct Azure subscription you would like to use? Signing System (y/n): y
2024-01-09 17:01:35,892 - INFO - Initializing terraform in C:\Users\username.config\farmvibes-ai\terraform-user\aks\modules\rg
2024-01-09 17:01:37,137 - INFO - terraform.EXE:
2024-01-09 17:01:37,137 - INFO - terraform.EXE: Initializing the backend...
2024-01-09 17:01:38,700 - INFO - terraform.EXE:
2024-01-09 17:01:38,700 - INFO - terraform.EXE: Initializing provider plugins...
2024-01-09 17:01:38,700 - INFO - terraform.EXE: - Finding hashicorp/azurerm versions matching "3.46.0"...
2024-01-09 17:01:38,939 - INFO - terraform.EXE: - Finding hashicorp/random versions matching "3.1.0"...
2024-01-09 17:01:39,067 - INFO - terraform.EXE: - Using previously-installed hashicorp/random v3.1.0
2024-01-09 17:01:39,628 - INFO - terraform.EXE: - Using previously-installed hashicorp/azurerm v3.46.0
2024-01-09 17:01:39,637 - INFO - terraform.EXE:
2024-01-09 17:01:39,638 - INFO - terraform.EXE: Terraform has been successfully initialized!
2024-01-09 17:01:39,638 - INFO - terraform.EXE:
2024-01-09 17:01:39,638 - INFO - terraform.EXE: You may now begin working with Terraform. Try running "terraform plan" to see
2024-01-09 17:01:39,638 - INFO - terraform.EXE: any changes that are required for your infrastructure. All Terraform commands
2024-01-09 17:01:39,639 - INFO - terraform.EXE: should now work.
2024-01-09 17:01:39,639 - INFO - terraform.EXE:
2024-01-09 17:01:39,639 - INFO - terraform.EXE: If you ever set or change modules or backend configuration for Terraform,
2024-01-09 17:01:39,639 - INFO - terraform.EXE: rerun this command to reinitialize your working directory. If you forget, other
2024-01-09 17:01:39,640 - INFO - terraform.EXE: commands will detect it and remind you to do so if necessary.
2024-01-09 17:01:39,645 - INFO - Creating resource group if necessary...
2024-01-09 17:01:40,560 - INFO - Applying terraform in C:\Users\username.config\farmvibes-ai\terraform-user\aks\modules\rg
2024-01-09 17:01:45,704 - INFO - terraform.EXE: random_string.name_suffix: Refreshing state... [id=veqrk]
2024-01-09 17:02:05,154 - INFO - terraform.EXE: azurerm_resource_group.resourcegroup: Refreshing state... [id=/subscriptions/6db07b7f-a16a-4fc6-9ecb-ab972xxxxx/resourceGroups/restapi-rg]
2024-01-09 17:02:06,165 - INFO - terraform.EXE:
2024-01-09 17:02:06,165 - INFO - terraform.EXE: No changes. Your infrastructure matches the configuration.
2024-01-09 17:02:06,165 - INFO - terraform.EXE:
2024-01-09 17:02:06,165 - INFO - terraform.EXE: Terraform has compared your real infrastructure against your configuration
2024-01-09 17:02:06,166 - INFO - terraform.EXE: and found no differences, so no changes are needed.
2024-01-09 17:02:06,177 - INFO - terraform.EXE:
2024-01-09 17:02:06,177 - INFO - terraform.EXE: Apply complete! Resources: 0 added, 0 changed, 0 destroyed.
2024-01-09 17:02:06,178 - INFO - terraform.EXE:
2024-01-09 17:02:09,215 - INFO - JSONDecodeError: Expecting value: line 1 column 1 (char 0)
2024-01-09 17:02:09,215 - INFO - Failed to create cluster. Cleaning up...

Do you wish the keep the cluster (Answering 'y' will leave the cluster as is)? (y/n): n
2024-01-09 17:02:53,843 - INFO - Destroying cluster...
2024-01-09 17:02:53,844 - INFO - Verifying if group still exists...
2024-01-09 17:02:55,710 - INFO - Group exists. Requesting destruction (this may take some time)...
2024-01-09 17:02:57,544 - INFO - Destroying resource group, as it was created by us...
2024-01-09 17:03:14,688 - INFO - Cluster destroyed.

As you see above, there are two errors. The first one is actually when it tries to run the command you asked me in the last question....

Here is the output from running the command you asked;

PS C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\wbin> az aks show -n farmvibes-aks -g restapi-rg -o tsv
(ResourceNotFound) The Resource 'Microsoft.ContainerService/managedClusters/farmvibes-aks' under resource group 'restapi-rg' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix
Code: ResourceNotFound
Message: The Resource 'Microsoft.ContainerService/managedClusters/farmvibes-aks' under resource group 'restapi-rg' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix
PS C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\wbin>

Thanks

Hi. The installer should have written more detailed logs in C:\Users\username\.cache\farmvibes-ai\farmvibes-ai-remote.log.

Please check if that file exists. It might have more meaningful data for us to try and figure why your install failed.

If you'd like, please share that file here, otherwise, all we know is that, for some reason, the installer failed to parse the output of the Azure CLI.

Thanks

Hello:
Attached is the requested log.
I look forward to your comments.

Regards;
Gus
farmvibes-ai-remote.log

commented

Hello Renato:

I uninstalled the 32-bit version and installed the 64-bit of Azure CLI2.
I ran the installer, and I got the similar errors.
Please find the attached log for your review.

Thanks
Gus
farmvibes-ai-remote.log

I took a look at the logs, but didn't see any obvious errors in them.

I'm wondering whether this could be due to the kubernetes version.

Can you please update the src/vibe_core/vibe_core/terraform/aks/modules/infra/kubernetes.tf and change the kubernetes_version variable to 1.27.7, and then reinstall vibe_core?

Thanks,
Renato.

Hello Renato:

Just to clarify: I'm running the installation in a Windows machine, and trying to install the remote cluster in Azure.

After updating the kubernetes.tf version to 1.27.7, and reinstalling the vibe-core, I'm not seeing any change (same errors continue...) when running the remote cluster installation.
Quick question; how the reinstallation of vibe_core affects the kubernetes version if kubernetes run in the Azure environment? Reinstalling the vibe_core did not update any requirement, apparently.

Thanks
Gus

Hi Gus,

When you install vibe_core, pip copies the terraform definition files from the repo to your python site packages.

The terraform files contain a declarative specification (with some variables, set at runtime from the command line arguments) of what the target environments should contain once setup is complete.

Changing the terraform files and reinstalling should make the farmvibes-ai script use that particular kubernetes version when building a new AKS cluster.

——

I guessed the kubernetes version needed changing because some Azure regions deprecated the version we were using (that version will be updated in our next PR).

Still, as I mentioned before, I couldn’t spot any obvious errors in the log files you shared. I’ll investigate a bit more.

Thanks for your patience,
Renato.

Hello @gussabina, I've found a couple encoding bugs in the code which prevented the AKS cluster setup code from completing.

I'm working a PR that will fix that.

Thanks,
Renato.

Hello Renato:
This is great!. Thanks for letting me know.
My expectations to work with Azure AKS cluster is that all the heavy process will be on-demand...Is this right?
Since many of the processes require a lot of time/resources, I would prefer to have an on-demand approach rather than a big/expensive VM running for a couple of days..
Not sure if my assumptions are true but I'm going to figure it out.

Regards;
Gus

Hello Gus,

Sort of. Right now, we setup two pools of VMs on Azure when setting up AKS:

  1. A node pool for kubernetes management services
  2. A node pool for our heavy weight workers

For both node pools, we have auto-scaling enabled, which means that we will only allocate VMs that are actually needed for performing work.

The Standard_B4ms are relatively cheap, but since they are required by kubernetes itself, they will never auto-scale to zero.

The Standard_D8s_v3 VMs are more expensive, and for now we configure AKS to scale down to at least one. (Although you could, in principle, update the cluster definition to allow scaling to zero.)

So, your assumption is true in the sense that we auto-scale VMs to the minimum kubernetes deems necessary.

On the other hand, our pods do not auto-scale, which means that if you want to cut costs, you need to scale them manually. (For example, with kubectl scale deployment terravibes-worker --replicas=0.)

To be able to auto scale pods to zero, we need to make some changes to the backend that are not in our roadmap right now (however, that might change if we see demand for that feature).

Hello @gussabina, we just merged a new PR that should fix the issues you were having.

It'd be great if you could pull the changes and give them a try.

Thanks!

Hello @renatolfc

With the latest changes, everything worked fine! Thanks for the support!
BTW, what is the entry point for the API? I used to look at
https://farmvibes-aks-xxxxxx-dns.westus.cloudapp.azure.com/v0/docs
to try the API from the browser, but I'm getting the "Failed to load API definition" error now...

Regards;
Gus

We found an issue with the JSON schema that is breaking the API documentation page generation (previously reported in #131), @gussabina. We just fixed internally and a PR fix will be coming in the next days.

@rafaspadilha @brsilvarec @renatolfc - Getting this error while trying to setup. I earlier tried remote setup by first creating a VM as I was unable to create that on VM(local) and was able to setup successfully. But due to cost issues I had to destroy that VM and starting a fresh this directly from my laptop trying to create a Remote AKS cluster.

Ran this command - farmvibes-ai remote setup --region eastus --cert-email chetan.sharma@tavant.com --resource-group rg-farmvibes-ai

I would like to understand what is the significance of passing --cert-email parameter?

Because ypour document says - To create a new remote cluster, you have to provide some required arguments to farmvibes-ai remote setup: the name of the cluster, the resource group name, and the location of the cluster, as well as an email address to generate the TLS certificates for the REST API.

2024-02-01 15:04:08,999 - INFO - terraform.EXE: │ Error: Get "https://farmvibes-akskbsdns-bwxpe0w1.hcp.eastus.azmk8s.io:443/api/v1/namespaces/default": tls: failed to verify certificate: x509: certificate signed by unknown authority
2024-02-01 15:04:09,001 - INFO - terraform.EXE: │
2024-02-01 15:04:09,001 - INFO - terraform.EXE: │ with data.kubernetes_namespace.kubernetesnamespace,
2024-02-01 15:04:09,002 - INFO - terraform.EXE: │ on init.tf line 8, in data "kubernetes_namespace" "kubernetesnamespace":
2024-02-01 15:04:09,003 - INFO - terraform.EXE: │ 8: data "kubernetes_namespace" "kubernetesnamespace" {
2024-02-01 15:04:09,004 - INFO - terraform.EXE: │
2024-02-01 15:04:09,005 - INFO - terraform.EXE: ╵
2024-02-01 15:04:09,214 - INFO - terraform.EXE: Releasing state lock. This may take a few moments...
2024-02-01 15:04:10,535 - INFO - ValueError: Failed to apply terraform resources in C:\Users\chetan.sharma.config\farmvibes-ai\terraform-user\aks\modules\kubernetes
2024-02-01 15:04:10,536 - INFO - Failed to update cluster.
2024-02-01 15:04:10,537 - INFO - Skipping cluster deletion since this is an update, please try again later if the cluster is misbehaving.

Hi @renatolfc @rafaspadilha @brsilvarec can someone please respond on this thread?

Hello @chetan2309, can you please share the contents of the remote logs file? It should be located in C:\Users\<YOUR_USER_NAME>\.cache\farmvibes-ai-remote.log.

We don't save any sensitive information in that file, but we do save the user name, the subscription name, the subscription id and the tentant id you used to authenticate with Azure. If you consider that information sensitive, please replace those strings before sharing the file.

Or, if you prefer, you can email me the log file. (This will probably take longer to review, though.)

One piece of information that I find odd in the snippet you shared is that the management script is trying to update the cluster, instead of doing a setup.

To answer your question: the --cert-email argument is needed to request a TLS certificate to secure your REST API endpoint once the cluster is setup.

We use letsencrypt.org to generate TLS certificates for you, and setup an automated process for requesting and renewing any certificates we need to create.

@chetan2309, please, do you have any updates on this?

Closing this issue for now. Feel free to reopen it if you are still experiencing this after the new FarmVibes.AI update.