bigscience-workshop / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

stage3 error: IndexError: list index out of range

PhdShi opened this issue · comments

commented

I am using the Deepspeed framework to train the Megatron model. The stage1 and stage2 went well. But when I used stage3, an error occurred:

    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  File "/workspace/Megatron-DeepSpeed/megatron/training.py", line 136, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/workspace/Megatron-DeepSpeed/megatron/training.py", line 518, in setup_model_and_optimizer
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 165, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 309, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1184, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1474, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer_Stage3(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 300, in __init__
    self._setup_for_real_optimizer()
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 382, in _setup_for_real_optimizer
    self.initialize_optimizer_states()
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 892, in initialize_optimizer_states
    self._optimizer_step(i)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 817, in _optimizer_step
    self.optimizer.step()
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 265, in wrapper
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/apex/optimizers/fused_adam.py", line 121, in step
    device = group['params'][0].device
IndexError: list index out of range

My stage3_config.json is :

{
  "gradient_accumulation_steps": 1,
  "train_micro_batch_size_per_gpu": 1,
  "zero_optimization": {
    "stage": 3,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "overlap_comm": false,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "contiguous_gradients": false,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_prefetch_bucket_size": 5e8,
    "stage3_param_persistence_threshold": 1e6,
    "stage3_gather_16bit_weights_on_model_save": false,
    "sub_group_size": 1e12
  }
}

I don't understand why this error occurred. other arguments:

          --tensor-model-parallel-size 1 \
          --pipeline-model-parallel-size 1 \
          --no-pipeline-parallel \
          --log-interval 1 \
          --save-interval 500 \
          --eval-interval 100 \
          --eval-iters 10 \
commented

BUG fixed. It's an apex error and fixed by this commit https://github.com/NVIDIA/apex/pull/1596