Vision Transformer cifar10 bug

Question

Vision Transformer cifar10 bug

gaow0007 opened this issue 2 years ago · comments

🐛 Describe the bug

When I run a vit experiment by the following command

node=76
prefix="srun --nodes=1 --gres=gpu:4 --cpus-per-task=4 --ntasks=1 -w SG-IDC1-10-51-2-$node"
$prefix colossalai run --nproc_per_node 4  train_with_cifar10.py --config configs/vit_1d_tp2_pp2.py --host=10.51.2.$node

I got

tensor shape 128
Traceback (most recent call last):
  File "train_with_cifar10.py", line 122, in <module>
tensor shape 128
Traceback (most recent call last):
  File "train_with_cifar10.py", line 122, in <module>
    main()
  File "train_with_cifar10.py", line 116, in main
    main()
  File "train_with_cifar10.py", line 116, in main
    engine.execute_schedule(data_iter, return_output_label=False)
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/engine/schedule/_pipeline_schedule.py", line 303, in forward_backward_step
    engine.execute_schedule(data_iter, return_output_label=False)
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/engine/schedule/_pipeline_schedule.py", line 303, in forward_backward_step
    input_tensor = comm.recv_forward(ft_shape,
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 194, in recv_forward
    input_tensor = comm.recv_forward(ft_shape,
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 194, in recv_forward
    input_tensor, _ = _communicate(recv_prev=True,
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 119, in _communicate
    input_tensor, _ = _communicate(recv_prev=True,
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 119, in _communicate
    tensor_recv_prev, recv_prev_split = create_recv_buffer_with_shapes(recv_prev_shape, dtype,
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 49, in create_recv_buffer_with_shapes
    tensor_recv_prev, recv_prev_split = create_recv_buffer_with_shapes(recv_prev_shape, dtype,
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 49, in create_recv_buffer_with_shapes
    recv_chunk_shape, recv_split = _get_tensor_shape(recv_shape, scatter_gather_tensors)
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 30, in _get_tensor_shape
    recv_chunk_shape, recv_split = _get_tensor_shape(recv_shape, scatter_gather_tensors)
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 30, in _get_tensor_shape
    tensor_chunk_shape = reduce(operator.mul, tensor_shape, 1)
TypeError: reduce() arg 2 must support iteration
    tensor_chunk_shape = reduce(operator.mul, tensor_shape, 1)
TypeError: reduce() arg 2 must support iteration

Environment

I install ColossalAI via

pip install colossalai==0.1.6+torch1.10cu10.2 -f https://release.colossalai.org

Other environment information is collected via this

PyTorch version: 1.11.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 5.3.0
Clang version: Could not collect
CMake version: version 3.19.3
Libc version: glibc-2.17

Python version: 3.8.13 (default, Mar 28 2022, 11:38:47)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-3.10.0-693.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration: 
GPU 0: Tesla V100-PCIE-32GB
GPU 1: Tesla V100-PCIE-32GB
GPU 2: Tesla V100-PCIE-32GB
GPU 3: Tesla V100-PCIE-32GB
GPU 4: Tesla V100-PCIE-32GB
GPU 5: Tesla V100-PCIE-32GB
GPU 6: Tesla V100-PCIE-32GB
GPU 7: Tesla V100-PCIE-32GB

Nvidia driver version: 470.63.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] colossalai==0.1.6+torch1.10cu10.2
[pip3] numpy==1.22.4
[pip3] torch==1.11.0
[pip3] torchvision==0.12.0
[conda] colossalai                0.1.6+torch1.10cu10.2          pypi_0    pypi
[conda] numpy                     1.22.4                   pypi_0    pypi
[conda] torch                     1.11.0                   pypi_0    pypi
[conda] torchvision               0.12.0                   pypi_0    pypi
``

Sijiang Fan · Answer 1 · Mon Jun 13 2022 10:09:20 GMT+0800 (China Standard Time)

I got the same problem. And if I change the config file to vit_pipeline.py, the error will be :

TypeError: layer_norm(): argument 'input' (position 1) must be Tensor, not list

YuliangLiu0306 · Answer 2 · Mon Jun 13 2022 15:02:07 GMT+0800 (China Standard Time)

hpcaitech/ColossalAI#1100
This PR resolved related bugs. You can try again with the lastest main branch code.

Sijiang Fan · Answer 3 · Tue Jun 14 2022 10:13:31 GMT+0800 (China Standard Time)

Thanks, Liu. I pulled the latest codes of ColossalAI and ColossalAi-Examples, then I got another error about titans:

Traceback (most recent call last):
  File "train_with_cifar10.py", line 13, in <module>
    from titans.model.vit.vit import _create_vit_model
  File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/__init__.py", line 3, in <module>
    from . import model
  File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/model/__init__.py", line 2, in <module>
    from . import gpt
  File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/model/gpt/__init__.py", line 1, in <module>
    from .gpt import *
  File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/model/gpt/gpt.py", line 6, in <module>
    from colossalai.builder.pipeline import partition_uniform
ModuleNotFoundError: No module named 'colossalai.builder.pipeline'

Even if I solved this problem, I got another problem from titans:

Traceback (most recent call last):
  File "train_with_cifar10.py", line 119, in <module>
    main()
  File "train_with_cifar10.py", line 54, in main
    model = _create_vit_model(**model_kwargs)
  File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/model/vit/vit.py", line 103, in _create_vit_model
    model = VisionTransformer(**model_kwargs)
  File "/root/conda/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/model/utils.py", line 52, in wrapper
    f(module, *args, **kwargs)
  File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/decorator/no_support.py", line 57, in new_init
    origin_init(*args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'hidden_size'

YuliangLiu0306 · Answer 4 · Tue Jun 14 2022 22:51:18 GMT+0800 (China Standard Time)

I think your problem will be resolved by pulling the lastest codes of Titans as well. Sorry about the unstable APIs, we will improve related issues in future release.

Sijiang Fan · Answer 5 · Thu Jun 16 2022 11:27:21 GMT+0800 (China Standard Time)

Thanks, Liu. The problem was solved by reinstalling titans. But the training process will be stuck at the 86/196 step.

I used 4 A6000 GPUs with colossalai run --nproc_per_node 4 train_with_cifar10.py --config configs/vit_1d_tp2_pp2.py

binmakeswell · Answer 6 · Fri Jun 24 2022 11:12:58 GMT+0800 (China Standard Time)

Hi @edwardhorp Thank you for your feedback, we have located the reason and are working on it. We will let you know once it is fixed!

YuliangLiu0306 · Answer 7 · Tue Jun 28 2022 13:38:59 GMT+0800 (China Standard Time)

The reason for training process stuck is that different pipeline stage got different overflow status, if the overflow rank do not join the clip grad norm, the all reduce process will be stuck. This bug has been fixed in PR(hpcaitech/ColossalAI#1175).