google-research / big_vision

Hi! I am trying to run the vit_s16_i1k.py script on a TPU-v3-8 machine. I put the data in a google-storage bucket and I am running the following command

TFDS_DATA_DIR=gs://bucket-name/ python3 -m big_vision.train --config big_vision/configs/vit_s16_i1k.py --workdir workdirs/i1k_training_`date '+%m-%d_%H%M'`

The training runs for a few iterations, and then fails with the killed message. When I look at htop outputs, the memory used by the process grows all the way up to 335G available before the process crashes.

I have been able to work around this issue by creating a data disk, mounting it on the TPU VM and putting the data there. In that case the same process only uses 205G of RAM and runs normally.

Try by reducing your Batch size since your workaround of using a data disk and mounting it on the TPU VM seems to have alleviated the issue by reducing memory usage.

Only saw this now. Knowing Pavel, I think it's not relevant to him anymore, but for reference, here are two more options, at the slight expense of a little speed:

Set cache_raw to False in the config:

big_vision/big_vision/configs/vit_s16_i1k.py

Line 48 in c01707f

config.input.cache_raw = True # Needs up to 120GB of RAM!
For all evaluators, in their config, set cache_final to False:

big_vision/big_vision/evaluators/classification.py

Line 59 in c01707f

cache_final=True, cache_raw=False, prefetch=1,

Running out of RAM on cloud TPU when reading data from Cloud Storage