stage3 error: IndexError: list index out of range
PhdShi opened this issue · comments
PhdShi commented
I am using the Deepspeed framework to train the Megatron model. The stage1 and stage2 went well. But when I used stage3, an error occurred:
pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
File "/workspace/Megatron-DeepSpeed/megatron/training.py", line 136, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(
File "/workspace/Megatron-DeepSpeed/megatron/training.py", line 518, in setup_model_and_optimizer
model, optimizer, _, lr_scheduler = deepspeed.initialize(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 309, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1184, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1474, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer_Stage3(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 300, in __init__
self._setup_for_real_optimizer()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 382, in _setup_for_real_optimizer
self.initialize_optimizer_states()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 892, in initialize_optimizer_states
self._optimizer_step(i)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 817, in _optimizer_step
self.optimizer.step()
File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 265, in wrapper
out = func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/apex/optimizers/fused_adam.py", line 121, in step
device = group['params'][0].device
IndexError: list index out of range
My stage3_config.json is :
{
"gradient_accumulation_steps": 1,
"train_micro_batch_size_per_gpu": 1,
"zero_optimization": {
"stage": 3,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": false,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients": false,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_prefetch_bucket_size": 5e8,
"stage3_param_persistence_threshold": 1e6,
"stage3_gather_16bit_weights_on_model_save": false,
"sub_group_size": 1e12
}
}
I don't understand why this error occurred. other arguments:
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 1 \
--no-pipeline-parallel \
--log-interval 1 \
--save-interval 500 \
--eval-interval 100 \
--eval-iters 10 \
PhdShi commented
BUG fixed. It's an apex error and fixed by this commit https://github.com/NVIDIA/apex/pull/1596