hpcaitech / ColossalAI-Examples

Examples of training models with hybrid parallelism using ColossalAI

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Vision Transformer cifar10 bug

gaow0007 opened this issue Β· comments

πŸ› Describe the bug

When I run a vit experiment by the following command

node=76
prefix="srun --nodes=1 --gres=gpu:4 --cpus-per-task=4 --ntasks=1 -w SG-IDC1-10-51-2-$node"
$prefix colossalai run --nproc_per_node 4  train_with_cifar10.py --config configs/vit_1d_tp2_pp2.py --host=10.51.2.$node

I got

tensor shape 128
Traceback (most recent call last):
  File "train_with_cifar10.py", line 122, in <module>
tensor shape 128
Traceback (most recent call last):
  File "train_with_cifar10.py", line 122, in <module>
    main()
  File "train_with_cifar10.py", line 116, in main
    main()
  File "train_with_cifar10.py", line 116, in main
    engine.execute_schedule(data_iter, return_output_label=False)
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/engine/schedule/_pipeline_schedule.py", line 303, in forward_backward_step
    engine.execute_schedule(data_iter, return_output_label=False)
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/engine/schedule/_pipeline_schedule.py", line 303, in forward_backward_step
    input_tensor = comm.recv_forward(ft_shape,
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 194, in recv_forward
    input_tensor = comm.recv_forward(ft_shape,
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 194, in recv_forward
    input_tensor, _ = _communicate(recv_prev=True,
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 119, in _communicate
    input_tensor, _ = _communicate(recv_prev=True,
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 119, in _communicate
    tensor_recv_prev, recv_prev_split = create_recv_buffer_with_shapes(recv_prev_shape, dtype,
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 49, in create_recv_buffer_with_shapes
    tensor_recv_prev, recv_prev_split = create_recv_buffer_with_shapes(recv_prev_shape, dtype,
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 49, in create_recv_buffer_with_shapes
    recv_chunk_shape, recv_split = _get_tensor_shape(recv_shape, scatter_gather_tensors)
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 30, in _get_tensor_shape
    recv_chunk_shape, recv_split = _get_tensor_shape(recv_shape, scatter_gather_tensors)
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 30, in _get_tensor_shape
    tensor_chunk_shape = reduce(operator.mul, tensor_shape, 1)
TypeError: reduce() arg 2 must support iteration
    tensor_chunk_shape = reduce(operator.mul, tensor_shape, 1)
TypeError: reduce() arg 2 must support iteration

Environment

I install ColossalAI via

pip install colossalai==0.1.6+torch1.10cu10.2 -f https://release.colossalai.org

Other environment information is collected via this

PyTorch version: 1.11.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 5.3.0
Clang version: Could not collect
CMake version: version 3.19.3
Libc version: glibc-2.17

Python version: 3.8.13 (default, Mar 28 2022, 11:38:47)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-3.10.0-693.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration: 
GPU 0: Tesla V100-PCIE-32GB
GPU 1: Tesla V100-PCIE-32GB
GPU 2: Tesla V100-PCIE-32GB
GPU 3: Tesla V100-PCIE-32GB
GPU 4: Tesla V100-PCIE-32GB
GPU 5: Tesla V100-PCIE-32GB
GPU 6: Tesla V100-PCIE-32GB
GPU 7: Tesla V100-PCIE-32GB

Nvidia driver version: 470.63.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] colossalai==0.1.6+torch1.10cu10.2
[pip3] numpy==1.22.4
[pip3] torch==1.11.0
[pip3] torchvision==0.12.0
[conda] colossalai                0.1.6+torch1.10cu10.2          pypi_0    pypi
[conda] numpy                     1.22.4                   pypi_0    pypi
[conda] torch                     1.11.0                   pypi_0    pypi
[conda] torchvision               0.12.0                   pypi_0    pypi
``

I got the same problem. And if I change the config file to vit_pipeline.py, the error will be :

TypeError: layer_norm(): argument 'input' (position 1) must be Tensor, not list

hpcaitech/ColossalAI#1100
This PR resolved related bugs. You can try again with the lastest main branch code.

Thanks, Liu. I pulled the latest codes of ColossalAI and ColossalAi-Examples, then I got another error about titans:

Traceback (most recent call last):
  File "train_with_cifar10.py", line 13, in <module>
    from titans.model.vit.vit import _create_vit_model
  File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/__init__.py", line 3, in <module>
    from . import model
  File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/model/__init__.py", line 2, in <module>
    from . import gpt
  File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/model/gpt/__init__.py", line 1, in <module>
    from .gpt import *
  File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/model/gpt/gpt.py", line 6, in <module>
    from colossalai.builder.pipeline import partition_uniform
ModuleNotFoundError: No module named 'colossalai.builder.pipeline'

Even if I solved this problem, I got another problem from titans:

Traceback (most recent call last):
  File "train_with_cifar10.py", line 119, in <module>
    main()
  File "train_with_cifar10.py", line 54, in main
    model = _create_vit_model(**model_kwargs)
  File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/model/vit/vit.py", line 103, in _create_vit_model
    model = VisionTransformer(**model_kwargs)
  File "/root/conda/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/model/utils.py", line 52, in wrapper
    f(module, *args, **kwargs)
  File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/decorator/no_support.py", line 57, in new_init
    origin_init(*args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'hidden_size'

I think your problem will be resolved by pulling the lastest codes of Titans as well. Sorry about the unstable APIs, we will improve related issues in future release.

Thanks, Liu. The problem was solved by reinstalling titans. But the training process will be stuck at the 86/196 step.

I used 4 A6000 GPUs with colossalai run --nproc_per_node 4 train_with_cifar10.py --config configs/vit_1d_tp2_pp2.py

Hi @edwardhorp Thank you for your feedback, we have located the reason and are working on it. We will let you know once it is fixed!

The reason for training process stuck is that different pipeline stage got different overflow status, if the overflow rank do not join the clip grad norm, the all reduce process will be stuck. This bug has been fixed in PR(hpcaitech/ColossalAI#1175).