hardmaru / WorldModelsExperiments

World Models Experiments

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MemoryError in vae_train.py

Chazzz opened this issue · comments

Running python vae_train.py prompts a memory error on my system. I felt bad about this, but after running the numbers, vae_train.py needs to allocate ~125 GB of memory to this array!

>>> import numpy as np
>>> M = 1000
>>> N = 10000
>>> data = np.zeros((M*N, 64, 64, 3), dtype=np.uint8)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
MemoryError

Hmm, this looks like #19. I am trying the solution suggested there. Thanks for crunching the numbers, I had a measly 16gigs when it happened to me.

@zuoanqh Not tremendously surprising that memory limitations are present in both experiments. A more dynamic loading would probably fix both issues.

@zuoanqh Not sure how far you got on this, but I have memory-free loading (not including training) at 1.25 hours (8 mins per epoch) in my fork's atari/vae_train.py. I convert the episodes into uncompressed (10x), individual images (100x), which are then loaded in parallel (10x) before being fed into tensorflow. Also being in black and white (atari only) is another 3x performance improvement which doesn't convert to doom and carracing. The only faster alternative I can think of is to convert to BMP and get tensorflow to manage the entire batching process using parallel prefetching.

Note that 10M uncompressed frames is about 80GB for single channel and 240GB for tri-channel images and takes several hours. VAE training (not including loading) takes about 5 hours on my system.

@Chazzz my experiment requires transitions rather than frames, so that's taking a bit more time to upgrade without doubling disk/memory usage -- i got it to work with about 1k episodes though...

@zuoanqh Yikes that's a lot of channels, then again you don't really need 10k episodes unless you're creating a baseline.