RuntimeError: CUDA out of memory with cifar10 in data_parallel example
fuhengwu2021 opened this issue Β· comments
Fuheng Wu commented
π Describe the bug
I am trying train_with_cifar10.py in https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/vision_transformer/data_parallel
My command:
colossalai run --nproc_per_node 2 train_with_cifar10.py --config config.py
I have 7 GPUs with 16G GPU memory for each one.
The error traceback is like:
...
RuntimeError: CUDA out of memory. Tried to allocate 296.00 MiB (GPU 1; 15.78 GiB total capacity; 13.75 GiB already
allocated; 232.19 MiB free; 13.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try
setting max_split_size_mb to avoid fragmentation.
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "train_with_cifar10.py", line 71, in <module>
main()
File "train_with_cifar10.py", line 62, in main
trainer.fit(train_dataloader=train_dataloader,
File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 321, in fit
self._train_epoch(
File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 181, in _train_epoch
logits, label, loss = self.engine.execute_schedule(
File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 201, in execute_schedule
output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 78, in forward_backward_step
output = self._call_engine(engine, data)
File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 109, in _call_engine
return engine(inputs)
File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 186, in __call__
return self.model(*args, **kwargs)
File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wfh/.local/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 12, in decorate_autocast
return func(*args, **kwargs)
File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/amp/torch_amp/torch_amp.py", line 79, in forward
return self.model(*args, **kwargs)
File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wfh/.local/lib/python3.8/site-packages/timm/models/vision_transformer.py", line 465, in forward
x = self.forward_features(x)
File "/home/wfh/.local/lib/python3.8/site-packages/timm/models/vision_transformer.py", line 454, in forward_features
x = self.blocks(x)
File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wfh/.local/lib/python3.8/site-packages/timm/models/vision_transformer.py", line 243, in forward
x = x + self.drop_path2(self.ls2(self.mlp(self.norm2(x))))
File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wfh/.local/lib/python3.8/site-packages/timm/models/layers/mlp.py", line 29, in forward
x = self.drop1(x)
File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/dropout.py", line 58, in forward
return F.dropout(input, self.p, self.training, self.inplace)
File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 1252, in dropout
return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
...
Environment
>>> import colossalai
>>> colossalai.__version__
'0.1.9'
>>> import torch
>>> torch.__version__
'1.12.1+cu113'
GPU:
$nvidia-smi
Wed Sep 28 16:54:08 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08 Driver Version: 510.73.08 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:61:00.0 Off | 0 |
| N/A 33C P0 43W / 300W | 3MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:62:00.0 Off | 0 |
| N/A 31C P0 41W / 300W | 3MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:67:00.0 Off | 0 |
| N/A 33C P0 41W / 300W | 3MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:69:00.0 Off | 0 |
| N/A 33C P0 42W / 300W | 3MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... On | 00000000:89:00.0 Off | 0 |
| N/A 34C P0 53W / 300W | 2360MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... On | 00000000:8A:00.0 Off | 0 |
| N/A 34C P0 56W / 300W | 4172MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... On | 00000000:8F:00.0 Off | 0 |
| N/A 32C P0 54W / 300W | 7307MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Fuheng Wu commented
fixed it by myself