CarperAI / trlx

A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

multigpu support for summarization ppo example

sayan1101 opened this issue Β· comments

πŸ› Describe the bug

this is not a bug. wanted to know how we can run the ppo training for summarization. this is the file i am trying to run: trlx_gptj_text_summarization.py which is in trlx/examples/summarize_rlhf. i tried to run it with changed accelerate configs:
'''
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
'''

ran it with accelerate launch --config_file configs/default_accelerate_config.yaml trlx_gptj_text_summarization.py.
but got cuda out of memory.
I am using 8 x RTX6000 GPUs. 76 vCPUs and 400GB RAM.

Do i need to make changes in the trlx_gptj_text_summarization.py file as well? if yes, please tell what changes are required.
Quick resolution will be highly appreciated.

Which trlX version are you using?

No response

Additional system and package information

No response

commented

Hello @sayan1101! You can check out the following instructions / configs that were used to train this example: https://github.com/CarperAI/trlx/tree/main/examples/summarize_rlhf#training-process

In particular, this example was trained with a config for two 80GB GPUs, so in order to not run out of memory you have to reduce batch_size in trlx_gptj_text_summarization.py (there is a note in the above link that also says this)

If you were unsuccessful even after that, or if you still want to use your config, you'd have to do the following changes:

  1. Change rw_device to 7 in here
    rw_device = torch.device("cuda:{}".format(1)) # set reward model device
  2. Change 'num_processes' to '7' in your accelerate config

This way, the reward model will be loaded on the 8th GPU and won't occupy the space for training LLM

Hello @sayan1101! You can check out the following instructions / configs that were used to train this example: https://github.com/CarperAI/trlx/tree/main/examples/summarize_rlhf#training-process

In particular, this example was trained with a config for two 80GB GPUs, so in order to not run out of memory you have to reduce batch_size in trlx_gptj_text_summarization.py (there is a note in the above link that also says this)

If you were unsuccessful even after that, or if you still want to use your config, you'd have to do the following changes:

  1. Change rw_device to 7 in here
    rw_device = torch.device("cuda:{}".format(1)) # set reward model device
  2. Change 'num_processes' to '7' in your accelerate config

This way, the reward model will be loaded on the 8th GPU and won't occupy the space for training LLM

Thanks for taking the time to reply. I tried using 4 x A100 gpu instance from runpod. even after making the changes that you mentioned, i failed to start the training process.
so these are the changes that i made:
rw_device = torch.device("cuda:{}".format(3))
such that the reward model is loaded in the 4th gpu. and changed the gpu configs to this:
"""
command_file: null
commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config:
deepspeed_config_file: configs/ds_config_trlx_gptj_summarize.json
zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_config: {}
fsdp_config: {}
gpu_ids: null
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
megatron_lm_config: {}
num_machines: 1
num_processes: 3
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false
"""

I have made the num_process = 3 in the default_accelerate_config.yaml as shown above so that the training can happen in the rest of the 3 gpus.

but i am getting runtime error everytime:
Screenshot 2023-10-24 at 6 34 01 PM

Pls suggest any way around for this.

commented

@sayan1101 If you could post whole stacktrace, including the error before the timeouts, that would be very helpful. And just to confirm, you're using A100 with 40GB of memory, is that correct?