failed to run gpt2 zero3 example

Question

failed to run gpt2 zero3 example

CHN-ChenYi opened this issue 2 years ago · comments

🐛 Describe the bug

Command:

OMP_NUM_THREADS=32 torchrun --standalone --nnodes=1 --nproc_per_node 2 train_gpt.py --config=gpt2_configs/gpt2_zero3.py --from_torch

Result:

Traceback (most recent call last):
  File "train_gpt.py", line 130, in <module>
    main()
  File "train_gpt.py", line 56, in main
    ctx = ZeroInitContext(target_device=torch.cuda.current_device(),
TypeError: __init__() missing 1 required positional argument: 'convert_fp16'
Traceback (most recent call last):
  File "train_gpt.py", line 130, in <module>
    main()
  File "train_gpt.py", line 56, in main
    ctx = ZeroInitContext(target_device=torch.cuda.current_device(),
TypeError: __init__() missing 1 required positional argument: 'convert_fp16'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 38441) of binary: /home/toga/.conda/envs/ColAI/bin/python
Traceback (most recent call last):
  File "/home/toga/.conda/envs/ColAI/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.10.1', 'console_scripts', 'torchrun')())
  File "/home/toga/.conda/envs/ColAI/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/toga/.conda/envs/ColAI/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
    run(args)
  File "/home/toga/.conda/envs/ColAI/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/toga/.conda/envs/ColAI/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/toga/.conda/envs/ColAI/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_gpt.py FAILED

Environment

colossalai

colossalai               0.1.1

nvcc:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0

Python

Python 3.8.12

PyTorch

torch                    1.10.1

Frank Lee · Answer 1 · Tue Mar 29 2022 22:34:57 GMT+0800 (China Standard Time)

Hi @CHN-ChenYi , we have updated the api for the zero init context. Can you install colossalai from source and try it again? The installation instruction is as follows:

git clone https://github.com/hpcaitech/ColossalAI.git
cd ColossalAI

# install dependency
pip install -r requirements/requirements.txt

# install colossalai
pip install .

Yi Chen · Answer 2 · Wed Mar 30 2022 16:26:59 GMT+0800 (China Standard Time)

Hi @CHN-ChenYi , we have updated the api for the zero init context. Can you install colossalai from source and try it again?

Thanks! The problem is solved after I changed to the latest version and rebuilt colossalai.