google / compare_gan

Compare GAN code.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Errors training BigGAN on TPU pods

apoorvkh opened this issue · comments

I'm trying to train a 128x128 image dataset with the BigGAN implementation here using a v2-128 pod, but am encountering several changing errors (highlights listed below) after the first "Dequeue next (500) batch(es) of data from outfeed". These remain even when I change the batch size from 2048 to 1024 and reduce iterations per run, etc. These don't occur when training on v2-8 or v3-8 TPUs. Have you ever encountered these while trying to train on pods instead, if that is the issue? Thanks!

  • Error recorded from infeed: Unable to enqueue when not opened
  • Caused by op u'input_pipeline_task0/while/InfeedQueue/enqueue/2'
  • Error recorded from outfeed: Step was cancelled by an explicit call to Session::Close().

I actually likely encountered these issues due to malformed input data and the TPU errors were just not informative. Please ignore.

Thank you for investigating this.
While we don't test for all TPU configurations it should work with v2 (given that the model fits in the memory).

Please file a bug against TensorFlow for a more informative error message.