tensorflow / cloud

The TensorFlow Cloud repository provides APIs that will allow to easily go from debugging and training your Keras and TensorFlow code in a local environment to distributed training in the cloud.

Home Page:https://github.com/tensorflow/cloud

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tensorflow Cloud Tutorial failed to train model

karlunho opened this issue · comments

When using the TensorFlow Cloud Tutorial https://www.tensorflow.org/cloud/tutorials/overview , the model fail to train on GCP. I've added the log file and a screenshot of the error . I'm a google employee and ldap is alankho - happy to share my GCP project with whoever is working on the bug.

The failure happened in Google AI Platform when training the model after about 5 hours

downloaded-logs-20210618-235625.csv

Tried again with TPUs, but also errored out

downloaded-logs-20210619-002835.csv

I can't tell if AI Platform is using the GPUs properly because when training, the GPU page show that the utilization is 0%, where as the CPU page shows close to 80%. I've uploaded screenshots for the team.

Screen Shot 2021-06-19 at 8 32 57 AM
Screen Shot 2021-06-19 at 8 31 19 AM

Hi my name is Aarushi Soni . I want to contribute to this issue . Is this issue still open ? I am first time contributor . Please guide me through this process.