Training stuck generating rollouts

Question

Training stuck generating rollouts

javirandor opened this issue a year ago · comments

Javi Rando commented a year ago

🐛 Describe the bug

This is a follow-up from issue #399.

I am facing this same issue even with the updated code.

I am trying to reproduce the HH fine-tuning example on Alpaca.

trlx.train(
    prompts=prompts,
    eval_prompts=eval_prompts,
    reward_fn=reward_fn,
    config=config,
    stop_sequences=["Human:", "human:", "Assistant:", "assistant:"]
)

My code gets stuck generating the second batch of rollouts 16/64.

I am running the code on a cluster with 8xA100s (80GB) and using a custom reward model. Providing a minimal reproducible example is a bit hard for my current setup. Do you have any pointers that can help me debug this issue?

My accelerate config as taken from the repo

compute_environment: LOCAL_MACHINE
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: no
dynamo_config: {}
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false

Which trlX version are you using?

Installed from sourced @355c974

Additional system and package information

Python 3.9, transformers==4.28.1, torch==2.0.0, accelerate==0.18.0, deepspeed==0.9.1

Javi Rando · Answer 1 · Wed May 24 2023 22:19:27 GMT+0800 (China Standard Time)

The code gets stuck in the gather_dict function in utils/modeling.py called from make_experience. More specifically, in the line torch.distributed.all_gather_object(objs, obj).

Javi Rando · Answer 2 · Wed May 24 2023 23:20:33 GMT+0800 (China Standard Time)

I could solve the problem by commenting out the calls to the gather_dict function since they were creating metadata that is not useful in my use case. I would be curious to see if there is a cleaner solution to this issue. It seems to be a native torch thing but the solution suggested there did not solve the problem.

Javi Rando · Answer 3 · Thu May 25 2023 03:07:20 GMT+0800 (China Standard Time)

I am now facing timeout in the training loop. The operation self.self.accelerator.backward(loss) in accelerate_base_trainer.py times out. Trace below. I checked all generations are not empty and a reward was computed for each of them

[rollout 64 / 64]: 100%|█████████████████████████████████████████████████████████████████| 64/64 [02:23<00:00,  2.24s/it]
[RANK 0] Starting training
[RANK 0] Evaluating model
[generation sweep 1/1 | eval batch 8/8]: 100%|█████████████████████████████████████████████| 8/8 [00:50<00:00,  6.30s/it]
[RANK 0] Computing rewards
[RANK 0] Summarizing evaluation
                                            Evaluation #0 reward/mean: -1.82                                             
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ prompt                                                ┃ output                                               ┃ reward ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
...
└───────────────────────────────────────────────────────┴──────────────────────────────────────────────────────┴────────┘
  0%|                                                                                           | 0/6000 [00:00<?, ?it/s]

[2] NCCL INFO Using network Socket
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=316, OpType=REDUCE, Timeout(ms)=1800000) ran for 1804999 milliseconds before timing out.
[0] NCCL INFO comm 0x5604f6165540 rank 1 nranks 2 cudaDev 1 busId 81000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=316, OpType=REDUCE, Timeout(ms)=1800000) ran for 1804999 milliseconds before timing out.

If I run on a single process and 2 GPUs: 1 for trainable model and 1 for reward model, I get the following error. Thought it could be useful to identify the problem.

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ trlx_training.py:153 in <module>                      │
│                                                                                                  │
│   150 )                                                                                          │
│   151                                                                                            │
│   152 print("Launching training")                                                                │
│ ❱ 153 trlx.train(                                                                                │
│   154 │   prompts=prompts,                                                                       │
│   155 │   eval_prompts=eval_prompts,                                                             │
│   156 │   reward_fn=reward_fn,                                                                   │
│                                                                                                  │
│ /trlx/trlx/trlx.py:128 in train                                                      │
│                                                                                                  │
│   125 │   )                                                                                      │
│   126 │   trainer.add_eval_pipeline(eval_pipeline)                                               │
│   127 │                                                                                          │
│ ❱ 128 │   trainer.learn()                                                                        │
│   129 │   return trainer                                                                         │
│   130                                                                                            │
│                                                                                                  │
│ /trlx/trlx/trainer/accelerate_base_trainer.py:546 in learn                           │
│                                                                                                  │
│   543 │   │   │   │   │   │   │   forward_time += time()                                         │
│   544 │   │   │   │   │   │   │   backward_time -= time()                                        │
│   545 │   │   │   │   │   │   │   print("going to backward", os.environ["RANK"])                 │
│ ❱ 546 │   │   │   │   │   │   │   self.accelerator.backward(loss)                                │
│   547 │   │   │   │   │   │   │   print("loss backwarded")                                       │
│   548 │   │   │   │   │   │   │   backward_time += time()                                        │
│   549 │   │   │   │   │   │   │   stats_accum.append(stats)                                      │
│                                                                                                  │
│/miniconda3/envs/trlx/lib/python3.9/site-packages/accelerate/accelerator.py:16 │
│ 77 in backward                                                                                   │
│                                                                                                  │
│   1674 │   │   │   # deepspeed handles loss scaling by gradient_accumulation_steps in its `back  │
│   1675 │   │   │   loss = loss / self.gradient_accumulation_steps                                │
│   1676 │   │   if self.distributed_type == DistributedType.DEEPSPEED:                            │
│ ❱ 1677 │   │   │   self.deepspeed_engine_wrapped.backward(loss, **kwargs)                        │
│   1678 │   │   elif self.distributed_type == DistributedType.MEGATRON_LM:                        │
│   1679 │   │   │   return                                                                        │
│   1680 │   │   elif self.scaler is not None:                                                     │
│                                                                                                  │
│ /miniconda3/envs/trlx/lib/python3.9/site-packages/accelerate/utils/deepspeed.p │
│ y:176 in backward                                                                                │
│                                                                                                  │
│   173 │   │   # - zero grad                                                                      │
│   174 │   │   # - checking overflow                                                              │
│   175 │   │   # - lr_scheduler step (only if engine.lr_scheduler is not None)                    │
│ ❱ 176 │   │   self.engine.step()                                                                 │
│   177 │   │   # and this plugin overrides the above calls with no-ops when Accelerate runs und   │
│   178 │   │   # Deepspeed, but allows normal functionality for non-Deepspeed cases thus enabli   │
│   179 │   │   # training loop that works transparently under many training regimes.              │
│                                                                                                  │
│/miniconda3/envs/trlx/lib/python3.9/site-packages/deepspeed/runtime/engine.py: │
│ 1988 in step                                                                                     │
│                                                                                                  │
│   1985 │   │   │   │   │   and self.quantizer.any_precision_switch()):                           │
│   1986 │   │   │   │   self._take_model_step(lr_kwargs, self.block_eigenvalue)                   │
│   1987 │   │   │   else:                                                                         │
│ ❱ 1988 │   │   │   │   self._take_model_step(lr_kwargs)                                          │
│   1989 │   │   │                                                                                 │
│   1990 │   │   │   report_progress = self.global_rank == 0 if self.global_rank else True         │
│   1991                                                                                           │
│                                                                                                  │
│ /miniconda3/envs/trlx/lib/python3.9/site-packages/deepspeed/runtime/engine.py: │
│ 1895 in _take_model_step                                                                         │
│                                                                                                  │
│   1892 │   │   │   │   # https://nvidia.github.io/apex/advanced.html#gradient-clipping           │
│   1893 │   │   │   │   master_params = amp.master_params(self.optimizer)                         │
│   1894 │   │   │   │   clip_grad_norm_(parameters=master_params, max_norm=self.gradient_clippin  │
│ ❱ 1895 │   │   self.optimizer.step()                                                             │
│   1896 │   │                                                                                     │
│   1897 │   │   if hasattr(self.optimizer, '_global_grad_norm'):                                  │
│   1898 │   │   │   self._global_grad_norm = self.optimizer._global_grad_norm                     │
│                                                                                                  │
│/miniconda3/envs/trlx/lib/python3.9/site-packages/deepspeed/runtime/zero/stage │
│ _1_and_2.py:1702 in step                                                                         │
│                                                                                                  │
│   1699 │   │   │   │   # create a flat gradients for parameters updated by this process          │
│   1700 │   │   │   │   # If we are last partition, ensure we have same size grads and partition  │
│   1701 │   │   │   │   if partition_id == dist.get_world_size(group=self.real_dp_process_group[  │
│ ❱ 1702 │   │   │   │   │   single_grad_partition = self.flatten_dense_tensors_aligned(           │
│   1703 │   │   │   │   │   │   self.averaged_gradients[i],                                       │
│   1704 │   │   │   │   │   │   int(self.partition_size[i])).to(self.single_partition_of_fp32_gr  │
│   1705 │   │   │   │   else:                                                                     │
│                                                                                                  │
│ /miniconda3/envs/trlx/lib/python3.9/site-packages/deepspeed/runtime/zero/stage │
│ _1_and_2.py:824 in flatten_dense_tensors_aligned                                                 │
│                                                                                                  │
│    821 │                                                                                         │
│    822 │   # create a flat tensor aligned at the alignment boundary                              │
│    823 │   def flatten_dense_tensors_aligned(self, tensor_list, alignment):                      │
│ ❱  824 │   │   return self.flatten(align_dense_tensors(tensor_list, alignment))                  │
│    825 │                                                                                         │
│    826 │   ############### Independent Partition Gradient ########################               │
│    827 │   def reduce_independent_p_g_buckets_and_remove_grads(self, param, i):                  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when 
checking argument for argument tensors in method wrapper_CUDA_cat)

Max · Answer 4 · Fri May 26 2023 18:58:43 GMT+0800 (China Standard Time)

That's a peculiar issue, does it these errors also occur when using example scripts and/or example reward models and base models? And if could you share your launching script, that might be also helpful.

Alex Havrilla · Answer 5 · Thu Jun 01 2023 22:21:32 GMT+0800 (China Standard Time)

@javirandor Any update on this? Does this also happen with example scripts?

Max · Answer 6 · Fri Sep 01 2023 19:04:39 GMT+0800 (China Standard Time)

Was not able to reproduce the error, but please do retry with most recent code if the issue still relevant.