When running `/scripts/gtav_multi.sh` I get `RuntimeError`

Question

When running `/scripts/gtav_multi.sh` I get `RuntimeError`

jakubLangr opened this issue 3 years ago · comments

When running /scripts/gtav_multi.sh I get RuntimeError:

[...]
optG_graphic, Include: graphics_renderer.output_layer.0.1.bias
setting up dataset
Start epoch 0...
Traceback (most recent call last):
  File "main_parallel.py", line 284, in <module>
    train_gamegan(opts.gpu, opts)
  File "main_parallel.py", line 190, in train_gamegan
    trainer.generator_trainstep(states, actions, warm_up=warm_up, epoch=epoch)
  File "/home/jupyter/GANTheftAuto/trainer.py", line 129, in generator_trainstep
    gout = self.netG(self.zdist, states, gen_actions, warm_up, train=train, epoch=epoch)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jupyter/GANTheftAuto/simulator_model/dynamics_engine.py", line 256, in forward
    hiddens, init_maps= self.run_warmup(zdist, states, actions, warm_up, train=train)
  File "/home/jupyter/GANTheftAuto/simulator_model/dynamics_engine.py", line 235, in run_warmup
    batch_size, prev_read_v, prev_alpha, M, zdist, step=i, force_sharp=force_sharp)
  File "/home/jupyter/GANTheftAuto/simulator_model/dynamics_engine.py", line 175, in run_step
    s = self.simple_enc(state)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 119, in forward
    input = module(input)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jupyter/GANTheftAuto/simulator_model/model_utils.py", line 30, in forward
    return tensor.view(self.size)
RuntimeError: shape '[-1, 3136]' is invalid for input of size 184320

This is after:

Removing logdir which was not recognized as an argument
Installing and validating pytorch 1.8.0, which seems to be the choice of this repo, so I doubt it is that.
Trying to adjust the resolution to match the input size (184320 = e.g. 640x288)
Searching thru the codebase for either number

What is strange is not that there is nothing, but rather a shape mismatch despite using the default parameters The strange thing is that the number that we are casting to (3136), does not clearly seem to relate to anything, but it is roughly on the same order as the resolution (80x48=3840).

This is made somewhat worse by the fact that

Daniel Kukieła · Answer 1 · Sat Jun 19 2021 05:09:30 GMT+0800 (China Standard Time)

Any changes that you made to this script?

Can you check gtav_multi_demo.sh and let us know if it works or not?

Harrison · Answer 2 · Sat Jun 19 2021 05:12:27 GMT+0800 (China Standard Time)

General note: Python 3.7 on Conda Torch.

May wind up needing some version info for at least Torch and we might need to add a minimum for it.

Daniel Kukieła · Answer 3 · Sat Jun 19 2021 05:23:00 GMT+0800 (China Standard Time)

It should work with Torch 1.4.0+, not sure about earlier versions.

Jakub Langr · Answer 4 · Sat Jun 19 2021 20:00:03 GMT+0800 (China Standard Time)

Thanks for your responses.

The torch was 1.8.0 installed via conda (didn't mention that part).

No changes to the python code.

Yes. gtav_multi_demo.sh seems to work; I think it is an issue with data/gtav/gtagan_1. Is there any chance I can get a subset of the dataset uploaded somewhere? Happy to give you an S3 destination you can just push to.

Can you also:

clarify whether the gtav_multi.sh were the parameters to reproduce the demo?
clarify whether you ever managed to get the code running in a distributed way across multiple GPUs? It seems that the --bs arg only pushes the data onto first GPU

Daniel Kukieła · Answer 5 · Sat Jun 19 2021 20:09:21 GMT+0800 (China Standard Time)

Well, you are supposed to update this path to where your dataset is. As well as other settings. There are no documentation made by Nvidia for GameGan, you're on your own as we are, but we can help as much as we can. You can also head to the config file which should give you some hints. Then, we are just reading from the code which settings do what.

As for training data - there is while secution about it in readme, please take a look. This demo script is prepared to work ootb, others - you need to configure.

No, this file is not what we used, but this demo is very close. Yes, we trained on multiple GPUs.

Harrison · Answer 6 · Sat Jun 19 2021 22:22:45 GMT+0800 (China Standard Time)

It looks like you're asking for some data @jakubLangr

We included sample data here: https://github.com/Sentdex/GANTheftAuto/tree/main/data/gtav/gtagan_2_sample

So you could just change gtagan_1 to gtagan_2_sample for this problem.

clarify whether the gtav_multi.sh were the parameters to reproduce the demo?

The batch size should be ~ as big as you can fit per GPU (more research needs to be done to find optimal batch size here), and num_gpu should be however many gpus you have. For the rest of the settings, yes, those are the chosen ones for the demos in the video.

clarify whether you ever managed to get the code running in a distributed way across multiple GPUs? It seems that the --bs arg only pushes the data onto first GPU

Did you modify the --num_gpu to 4? If it was left to 1, that would explain why --bs only went to a single GPU.

These models were trained across 4x A100 80GB GPUs on the same machine. I had to add some code to disregard the display GPU on the DGX station, otherwise yep, same settings/code.

Jakub Langr · Answer 7 · Mon Jun 21 2021 01:28:06 GMT+0800 (China Standard Time)

Hi both, thank you for your responses, very helpful.

I think it is now all working and I got it to train in some minimal version. I will dig deeper for what else I can do with it. The YT video looked very cool!

Yes, I was wondering if I gave you an S3 link to push some more data (~10%) if you would be so kind as to upload a bit of the data. But no worries if not. Understand you don't want to self-host it.