CUDA out of memory

Question

CUDA out of memory

khcf123 opened this issue 2 years ago · comments

Hi, im running on windows 10 and using latest starcraft 2. The error shown as below when I run "python run.py"

(torch_1_5) PS C:\Users\alexa\Downloads\mini-AlphaStar-main> python run.py
pygame 2.1.2 (SDL 2.0.18, Python 3.7.11)
Hello from the pygame community. https://www.pygame.org/contribute.html
run init
cudnn available
cudnn version 7604
initialed player
initialed teacher
start_time before training: 2022-01-01 18:11:32
map name: Simple64
player.name: MainPlayer
player.race: Race.protoss
start_time before reset: 2022-01-01 18:13:12
total_episodes: 1
start_episode_time before is_final: 2022-01-01 18:13:13
ActorLoop.run() Exception cause return, Detials of the Exception: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 2.00 GiB total capacity; 1.27 GiB already allocated; 0 bytes free; 1.33 GiB reserved in total by PyTorch)
Traceback (most recent call last):
File "C:\Users\alexa\Downloads\mini-AlphaStar-main\alphastarmini\core\rl\rl_vs_computer_wo_replay.py", line 213, in run
player_step = self.player.agent.step_from_state(state, player_memory)
File "C:\Users\alexa\Downloads\mini-AlphaStar-main\alphastarmini\core\rl\alphastar_agent.py", line 235, in step_from_state
hidden_state=hidden_state)
File "C:\Users\alexa\Downloads\mini-AlphaStar-main\alphastarmini\core\arch\agent.py", line 299, in action_logits_by_state
return_logits = True)
File "C:\Users\alexa\Downloads\mini-AlphaStar-main\alphastarmini\core\arch\arch_model.py", line 134, in forward
entity_embeddings, embedded_entity, entity_nums = self.entity_encoder(state.entity_state)
File "C:\Users\alexa\anaconda3\envs\torch_1_5\lib\site-packages\torch\nn\modules\module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "C:\Users\alexa\Downloads\mini-AlphaStar-main\alphastarmini\core\arch\entity_encoder.py", line 390, in forward
unit_types_one = torch.nonzero(batch, as_tuple=True)[-1]
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 2.00 GiB total capacity; 1.27 GiB already allocated; 0 bytes free; 1.33 GiB reserved in total by PyTorch)

run over

Can i know what is the problem here and what is the solution? Thanks

liuruoze · Answer 1 · Sat Jan 01 2022 20:12:38 GMT+0800 (China Standard Time)

Yes, this is due to your GPU card memory is not enough.

To fix the problem, try to decrese the number in the line#199 in the

alphastarmini/lib/hyper_parameters.py

MiniStar_Arch_Hyper_Parameters = ArchHyperParameters(batch_size=int(32 * 1.5 / P.Batch_Scale), sequence_length=int(32 * 8 / P.Seq_Scale),

the batch_size and sequence_length can be set to a small number to fit in your GPU card memory (you should also check the value of Batch_Scale and Seq_Scale defined in param.py).

Or you can just use CPU to run the program in the laptop, and switch to GPU when transferring to a server.

To change from GPU to CPU , change the value in the line#2 in the

run.py

USED_DEVICES = "0"

to

USED_DEVICES = "-1"

Hope this will solve your porblem.

khcf123 · Answer 2 · Sat Jan 01 2022 21:22:57 GMT+0800 (China Standard Time)

Thanks for your reply

My laptop GPU card memory is 2gb.
What batch_size and sequence_length can be set to a small number to fit in my GPU card memory (also the value of Batch_Scale and Seq_Scale defined in param.py)?

Can you please advice me, thanks!!

khcf123 · Answer 3 · Sat Jan 01 2022 21:52:10 GMT+0800 (China Standard Time)

Yes, I can get it to work after restart laptop, and change USED_DEVICES = "0" to USED_DEVICES = "-1".

But, it comes out another error:

ActorLoop.run() Exception cause return, Detials of the Exception: The game didn't advance to the expected game loop. Expected: 2712, got: 2709
Traceback (most recent call last):
File "C:\Users\alexa\Downloads\mini-AlphaStar-main\alphastarmini\core\rl\rl_vs_computer_wo_replay.py", line 253, in run
timesteps = env.step(env_actions, step_mul=STEP_MUL) # STEP_MUL step_mul
File "C:\Users\alexa\anaconda3\envs\torch_1_5\lib\site-packages\pysc2\lib\stopwatch.py", line 212, in _stopwatch
return func(*args, **kwargs)
File "C:\Users\alexa\anaconda3\envs\torch_1_5\lib\site-packages\pysc2\env\sc2_env.py", line 548, in step
return self._step(step_mul)
File "C:\Users\alexa\anaconda3\envs\torch_1_5\lib\site-packages\pysc2\env\sc2_env.py", line 565, in _step
return self._observe(target_game_loop=target_game_loop)
File "C:\Users\alexa\anaconda3\envs\torch_1_5\lib\site-packages\pysc2\env\sc2_env.py", line 670, in _observe
self._get_observations(target_game_loop)
File "C:\Users\alexa\anaconda3\envs\torch_1_5\lib\site-packages\pysc2\env\sc2_env.py", line 645, in _get_observations
"Expected: %s, got: %s") % (target_game_loop, game_loop))
ValueError: The game didn't advance to the expected game loop. Expected: 2712, got: 2709

run over

liuruoze · Answer 4 · Sun Jan 02 2022 08:09:22 GMT+0800 (China Standard Time)

Yes, I can get it to work after restart laptop, and change USED_DEVICES = "0" to USED_DEVICES = "-1".

But, it comes out another error:

ActorLoop.run() Exception cause return, Detials of the Exception: The game didn't advance to the expected game loop. Expected: 2712, got: 2709 Traceback (most recent call last): File "C:\Users\alexa\Downloads\mini-AlphaStar-main\alphastarmini\core\rl\rl_vs_computer_wo_replay.py", line 253, in run timesteps = env.step(env_actions, step_mul=STEP_MUL) # STEP_MUL step_mul File "C:\Users\alexa\anaconda3\envs\torch_1_5\lib\site-packages\pysc2\lib\stopwatch.py", line 212, in _stopwatch return func(*args, **kwargs) File "C:\Users\alexa\anaconda3\envs\torch_1_5\lib\site-packages\pysc2\env\sc2_env.py", line 548, in step return self._step(step_mul) File "C:\Users\alexa\anaconda3\envs\torch_1_5\lib\site-packages\pysc2\env\sc2_env.py", line 565, in _step return self._observe(target_game_loop=target_game_loop) File "C:\Users\alexa\anaconda3\envs\torch_1_5\lib\site-packages\pysc2\env\sc2_env.py", line 670, in _observe self._get_observations(target_game_loop) File "C:\Users\alexa\anaconda3\envs\torch_1_5\lib\site-packages\pysc2\env\sc2_env.py", line 645, in _get_observations "Expected: %s, got: %s") % (target_game_loop, game_loop)) ValueError: The game didn't advance to the expected game loop. Expected: 2712, got: 2709

run over

Yes, this is a problem that occasionally happens in windows SC2 (the rate is rare, actually I don't know the reason). However, this problem is not the content of the current issue, which should be discussed in a new issue. Please open a new issue. I will close the current issue for you.