tensorflow / tpu

Reference models and tools for Cloud TPUs.

Home Page:https://cloud.google.com/tpu/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

finetuning on v4-32 dies suddenly

shankerabhigyan opened this issue · comments

Finetuning for gemma dies suddenly after about 12 hours. There are no warnings or messages in the output logs, the process is just killed.
image
The script https://ai.google.dev/gemma/docs/distributed_tuning was being run using nohup.

What could be some possible debugging steps or is this a server-side problem?
I experienced the same behaviour in v3-8 devices.