When running `/scripts/gtav_multi.sh` I get `RuntimeError`
jakubLangr opened this issue · comments
When running /scripts/gtav_multi.sh
I get RuntimeError
:
[...]
optG_graphic, Include: graphics_renderer.output_layer.0.1.bias
setting up dataset
Start epoch 0...
Traceback (most recent call last):
File "main_parallel.py", line 284, in <module>
train_gamegan(opts.gpu, opts)
File "main_parallel.py", line 190, in train_gamegan
trainer.generator_trainstep(states, actions, warm_up=warm_up, epoch=epoch)
File "/home/jupyter/GANTheftAuto/trainer.py", line 129, in generator_trainstep
gout = self.netG(self.zdist, states, gen_actions, warm_up, train=train, epoch=epoch)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/jupyter/GANTheftAuto/simulator_model/dynamics_engine.py", line 256, in forward
hiddens, init_maps= self.run_warmup(zdist, states, actions, warm_up, train=train)
File "/home/jupyter/GANTheftAuto/simulator_model/dynamics_engine.py", line 235, in run_warmup
batch_size, prev_read_v, prev_alpha, M, zdist, step=i, force_sharp=force_sharp)
File "/home/jupyter/GANTheftAuto/simulator_model/dynamics_engine.py", line 175, in run_step
s = self.simple_enc(state)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 119, in forward
input = module(input)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/jupyter/GANTheftAuto/simulator_model/model_utils.py", line 30, in forward
return tensor.view(self.size)
RuntimeError: shape '[-1, 3136]' is invalid for input of size 184320
This is after:
- Removing
logdir
which was not recognized as an argument - Installing and validating pytorch 1.8.0, which seems to be the choice of this repo, so I doubt it is that.
- Trying to adjust the resolution to match the input size (184320 = e.g. 640x288)
- Searching thru the codebase for either number
What is strange is not that there is nothing, but rather a shape mismatch despite using the default parameters The strange thing is that the number that we are casting to (3136), does not clearly seem to relate to anything, but it is roughly on the same order as the resolution (80x48=3840).
This is made somewhat worse by the fact that
Any changes that you made to this script?
Can you check gtav_multi_demo.sh
and let us know if it works or not?
General note: Python 3.7 on Conda Torch.
May wind up needing some version info for at least Torch and we might need to add a minimum for it.
It should work with Torch 1.4.0+, not sure about earlier versions.
Thanks for your responses.
The torch was 1.8.0 installed via conda (didn't mention that part).
No changes to the python code.
Yes. gtav_multi_demo.sh
seems to work; I think it is an issue with data/gtav/gtagan_1
. Is there any chance I can get a subset of the dataset uploaded somewhere? Happy to give you an S3 destination you can just push to.
Can you also:
- clarify whether the
gtav_multi.sh
were the parameters to reproduce the demo? - clarify whether you ever managed to get the code running in a distributed way across multiple GPUs? It seems that the
--bs
arg only pushes the data onto first GPU
Well, you are supposed to update this path to where your dataset is. As well as other settings. There are no documentation made by Nvidia for GameGan, you're on your own as we are, but we can help as much as we can. You can also head to the config file which should give you some hints. Then, we are just reading from the code which settings do what.
As for training data - there is while secution about it in readme, please take a look. This demo script is prepared to work ootb, others - you need to configure.
No, this file is not what we used, but this demo is very close. Yes, we trained on multiple GPUs.
It looks like you're asking for some data @jakubLangr
We included sample data here: https://github.com/Sentdex/GANTheftAuto/tree/main/data/gtav/gtagan_2_sample
So you could just change gtagan_1
to gtagan_2_sample
for this problem.
clarify whether the gtav_multi.sh were the parameters to reproduce the demo?
The batch size should be ~ as big as you can fit per GPU (more research needs to be done to find optimal batch size here), and num_gpu should be however many gpus you have. For the rest of the settings, yes, those are the chosen ones for the demos in the video.
clarify whether you ever managed to get the code running in a distributed way across multiple GPUs? It seems that the --bs arg only pushes the data onto first GPU
Did you modify the --num_gpu
to 4? If it was left to 1, that would explain why --bs only went to a single GPU.
These models were trained across 4x A100 80GB GPUs on the same machine. I had to add some code to disregard the display GPU on the DGX station, otherwise yep, same settings/code.
Hi both, thank you for your responses, very helpful.
I think it is now all working and I got it to train in some minimal version. I will dig deeper for what else I can do with it. The YT video looked very cool!
Yes, I was wondering if I gave you an S3 link to push some more data (~10%) if you would be so kind as to upload a bit of the data. But no worries if not. Understand you don't want to self-host it.