google-research / big_vision

Official codebase used to develop Vision Transformer, SigLIP, MLP-Mixer, LiT and more.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Running out of RAM on cloud TPU when reading data from Cloud Storage

izmailovpavel opened this issue · comments

Hi! I am trying to run the vit_s16_i1k.py script on a TPU-v3-8 machine. I put the data in a google-storage bucket and I am running the following command

TFDS_DATA_DIR=gs://bucket-name/ python3 -m big_vision.train --config big_vision/configs/vit_s16_i1k.py --workdir workdirs/i1k_training_`date '+%m-%d_%H%M'`

The training runs for a few iterations, and then fails with the killed message. When I look at htop outputs, the memory used by the process grows all the way up to 335G available before the process crashes.

I have been able to work around this issue by creating a data disk, mounting it on the TPU VM and putting the data there. In that case the same process only uses 205G of RAM and runs normally.

Try by reducing your Batch size since your workaround of using a data disk and mounting it on the TPU VM seems to have alleviated the issue by reducing memory usage.

Only saw this now. Knowing Pavel, I think it's not relevant to him anymore, but for reference, here are two more options, at the slight expense of a little speed:

  1. Set cache_raw to False in the config:
    config.input.cache_raw = True # Needs up to 120GB of RAM!
  2. For all evaluators, in their config, set cache_final to False:
    cache_final=True, cache_raw=False, prefetch=1,