Errors while trying to train with two GPUs

Question

Errors while trying to train with two GPUs

barakw2021 opened this issue 3 years ago · comments

barakw2021 commented 3 years ago

Hi,

When trying to train on two GPUs, I'm getting this error:

Traceback (most recent call last):
File "run_clm.py", line 478, in
main()
File "run_clm.py", line 441, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1083, in train
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
File "/root/miniconda3/lib/python3.8/site-packages/transformers/integrations.py", line 520, in deepspeed_init
model, optimizer, _, lr_scheduler = deepspeed.initialize(
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/init.py", line 116, in initialize
engine = DeepSpeedEngine(args=args,
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 148, in init
self._configure_with_arguments(args, mpu)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 517, in _configure_with_arguments
self._config = DeepSpeedConfig(config_file, mpu, param_dict=self.config_params)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 597, in init
self._configure_train_batch_size()
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 732, in _configure_train_batch_size
self._set_batch_related_parameters()
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 728, in _set_batch_related_parameters
assert False,
AssertionError: Either train_batch_size or micro_batch_per_gpu needs to be provided

So if I added the flag --train_batch_size 8 and I got the following error:

Traceback (most recent call last):
File "run_clm.py", line 478, in
main()
File "run_clm.py", line 192, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/root/miniconda3/lib/python3.8/site-packages/transformers/hf_argparser.py", line 196, in parse_args_into_dataclasses
Traceback (most recent call last):
File "run_clm.py", line 478, in
raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--train_batch_size', '8']
main()
File "run_clm.py", line 192, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/root/miniconda3/lib/python3.8/site-packages/transformers/hf_argparser.py", line 196, in parse_args_into_dataclasses
raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--train_batch_size', '8']

Looks to me like a mismatch between deepspeed and transformers, do you have any suggestions on how to solve it?

This is my ds_report:

DeepSpeed general environment info:
torch install path ............... ['/root/miniconda3/lib/python3.8/site-packages/torch']
torch version .................... 1.7.1
torch cuda version ............... 11.0
nvcc version ..................... 11.0
deepspeed install path ........... ['/root/miniconda3/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.3.15, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.7, cuda 11.0

Peter Albert commented 3 years ago

Thanks!

Peter Albert · Answer 1 · Thu Apr 29 2021 00:40:19 GMT+0800 (China Standard Time)

Hi, thanks for the report! @barakw2021
This is because Huggingface Transformers changed how it is using the deepspeed config. I updated the ds_config.json and ds_config_gptneo.json with the new format. Could you please download the new ds_config file in this repo and try it again?

barakw2021 · Answer 2 · Thu Apr 29 2021 00:43:17 GMT+0800 (China Standard Time)

That machine is no longer available to me, I might try to do similar training in the next few days, if I'll do, I will let you know the results.

Thanks for the quick response

barakw2021 · Answer 3 · Fri Apr 30 2021 06:02:11 GMT+0800 (China Standard Time)

I tested it and it works, Thanks again