google-deepmind / xmanager

A platform for managing machine learning experiments

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ResourceExhausted: 429 The following quota metrics exceed quota limits

crystina-z opened this issue · comments

Hi! Thanks for building this amazing project. Recently I'm running script on xmanager+vertex.AI on TPU v2 and v3, but I keep getting this error:

google.api_core.exceptions.ResourceExhausted: 429 The following quota metrics exceed quota limits: aiplatform.googleapis.com/custom_model_training_tpu_v2

The error is thrown at this line - https://github.com/deepmind/xmanager/blob/v0.2.0/xmanager/cloud/vertex.py#L181.

Below are the sanity checks that I've done:

  • I found the service account here can be loaded nicely, tho it would soon be assigned to `None here as I'm requesting TPU v2 or v3.
  • tensorboard is set to empty string.
  • the self.location, self.project, pools and auth.get_bucket() all look good. where the location is us-central1, and pools showing --
[machine_spec {
  machine_type: "cloud-tpu"
  accelerator_type: TPU_V2
  accelerator_count: 8
}

I've enabled the three APIs mentioned in the readme (IAM, Cloud AI Platform, Container Registry), additionally Vertex API and Cloud Resource Manager API was enabled. I also checked the Quota page on the console, which looks fine as well. Doesn't look like I'm overusing the resources as described in the error message "exceed quota limits".

It's been bugging me for quite a few days, and would be really appreciated if anyone could suggest what's possibly going on there. Thanks in advance!

Sorry forgot this - I'm using Python 3.9 and xmanager==0.2.0. lmk if any more info is needed from me

I have the same issue can anyone help?