THUDM / SwissArmyTransformer

SwissArmyTransformer is a flexible and powerful library to develop your own Transformer variants.

Home Page:https://THUDM.github.io/SwissArmyTransformer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

AssertionError: data parallel group is not initialized

victorup opened this issue · comments

Hi,
I encounter an error as follows:

Traceback (most recent call last):
File "/data33/private/xinpeng/codebase/CogView2/pretrain_coglm.py", line 244, in
training_main(args, model_cls=BaseModel, forward_step_function=forward_step, create_dataset_function=create_dataset_function)
File "/home/xinpeng/miniconda3/envs/cogview/lib/python3.9/site-packages/SwissArmyTransformer/training/deepspeed_training.py", line 66, in training_main
train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'])
File "/home/xinpeng/miniconda3/envs/cogview/lib/python3.9/site-packages/SwissArmyTransformer/data_utils/configure_data.py", line 166, in make_loaders
group=mpu.get_data_parallel_group())
File "/home/xinpeng/miniconda3/envs/cogview/lib/python3.9/site-packages/SwissArmyTransformer/mpu/initialize.py", line 97, in get_data_parallel_group
assert _DATA_PARALLEL_GROUP is not None,
AssertionError: data parallel group is not initialized

this is strange. Have you made any changes on the file?
Please first try to install SwissArmyTransformer<0.3. If the error continues to appear, please tell me.

When I downgraded deepspeed to version 0.6.3, it worked. Thank you!