failed to run gpt2 zero3 example
CHN-ChenYi opened this issue Β· comments
Yi Chen commented
π Describe the bug
Command:
OMP_NUM_THREADS=32 torchrun --standalone --nnodes=1 --nproc_per_node 2 train_gpt.py --config=gpt2_configs/gpt2_zero3.py --from_torch
Result:
Traceback (most recent call last):
File "train_gpt.py", line 130, in <module>
main()
File "train_gpt.py", line 56, in main
ctx = ZeroInitContext(target_device=torch.cuda.current_device(),
TypeError: __init__() missing 1 required positional argument: 'convert_fp16'
Traceback (most recent call last):
File "train_gpt.py", line 130, in <module>
main()
File "train_gpt.py", line 56, in main
ctx = ZeroInitContext(target_device=torch.cuda.current_device(),
TypeError: __init__() missing 1 required positional argument: 'convert_fp16'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 38441) of binary: /home/toga/.conda/envs/ColAI/bin/python
Traceback (most recent call last):
File "/home/toga/.conda/envs/ColAI/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.10.1', 'console_scripts', 'torchrun')())
File "/home/toga/.conda/envs/ColAI/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/toga/.conda/envs/ColAI/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/home/toga/.conda/envs/ColAI/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/toga/.conda/envs/ColAI/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/toga/.conda/envs/ColAI/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_gpt.py FAILED
Environment
colossalai
colossalai 0.1.1
nvcc:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0
Python
Python 3.8.12
PyTorch
torch 1.10.1
Frank Lee commented
Hi @CHN-ChenYi , we have updated the api for the zero init context. Can you install colossalai
from source and try it again? The installation instruction is as follows:
git clone https://github.com/hpcaitech/ColossalAI.git
cd ColossalAI
# install dependency
pip install -r requirements/requirements.txt
# install colossalai
pip install .
Yi Chen commented
Hi @CHN-ChenYi , we have updated the api for the zero init context. Can you install
colossalai
from source and try it again?
Thanks! The problem is solved after I changed to the latest version and rebuilt colossalai.