Errors training BigGAN on TPU pods

Question

Errors training BigGAN on TPU pods

apoorvkh opened this issue 5 years ago · comments

I'm trying to train a 128x128 image dataset with the BigGAN implementation here using a v2-128 pod, but am encountering several changing errors (highlights listed below) after the first "Dequeue next (500) batch(es) of data from outfeed". These remain even when I change the batch size from 2048 to 1024 and reduce iterations per run, etc. These don't occur when training on v2-8 or v3-8 TPUs. Have you ever encountered these while trying to train on pods instead, if that is the issue? Thanks!

Error recorded from infeed: Unable to enqueue when not opened
Caused by op u'input_pipeline_task0/while/InfeedQueue/enqueue/2'
Error recorded from outfeed: Step was cancelled by an explicit call to Session::Close().

Apoorv Khandelwal · Answer 1 · Tue Mar 19 2019 12:12:51 GMT+0800 (China Standard Time)

I actually likely encountered these issues due to malformed input data and the TPU errors were just not informative. Please ignore.

Marvin · Answer 2 · Tue Mar 19 2019 19:59:08 GMT+0800 (China Standard Time)

Thank you for investigating this.
While we don't test for all TPU configurations it should work with v2 (given that the model fits in the memory).

Please file a bug against TensorFlow for a more informative error message.