Running out of RAM on cloud TPU when reading data from Cloud Storage
izmailovpavel opened this issue · comments
Hi! I am trying to run the vit_s16_i1k.py script on a TPU-v3-8 machine. I put the data in a google-storage bucket and I am running the following command
TFDS_DATA_DIR=gs://bucket-name/ python3 -m big_vision.train --config big_vision/configs/vit_s16_i1k.py --workdir workdirs/i1k_training_`date '+%m-%d_%H%M'`
The training runs for a few iterations, and then fails with the killed
message. When I look at htop
outputs, the memory used by the process grows all the way up to 335G available before the process crashes.
I have been able to work around this issue by creating a data disk, mounting it on the TPU VM and putting the data there. In that case the same process only uses 205G of RAM and runs normally.
Try by reducing your Batch size since your workaround of using a data disk and mounting it on the TPU VM seems to have alleviated the issue by reducing memory usage.
Only saw this now. Knowing Pavel, I think it's not relevant to him anymore, but for reference, here are two more options, at the slight expense of a little speed:
- Set
cache_raw
to False in the config: - For all evaluators, in their config, set
cache_final
to False: