TypeError: unsupported operand type(s) for -: 'float' and 'str' on AWS g4dn.12xlarge

Question

TypeError: unsupported operand type(s) for -: 'float' and 'str' on AWS g4dn.12xlarge

sibeshkar opened this issue 3 years ago · comments

Hi, thanks for making this repo. I'm on a g4dn.12xlarge (4 GPUs) Deep Learning AMI on AWS and trying to make this work. Keep running into this error. Anything I'm missing? TypeError: unsupported operand type(s) for -: 'float' and 'str'

Traceback (most recent call last):
  File "run_clm.py", line 478, in <module>
    main()
  File "run_clm.py", line 441, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/trainer.py", line 969, in train
    self, num_training_steps=max_steps, resume_from_checkpoint=resume_from_checkpoint
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/integrations.py", line 448, in init_deepspeed
    lr_scheduler=lr_scheduler,
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/deepspeed/__init__.py", line 125, in initialize
    config_params=config_params)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 187, in __init__
    self._configure_lr_scheduler(lr_scheduler)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 447, in _configure_lr_scheduler
    lr_scheduler = self._scheduler_from_config(self.optimizer)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 489, in _scheduler_from_config
    instantiated_scheduler = scheduler(optimizer, **scheduler_params)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/deepspeed/runtime/lr_schedules.py", line 708, in __init__
    self.delta_lrs = [big - small for big, small in zip(self.max_lrs, self.min_lrs)]
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/deepspeed/runtime/lr_schedules.py", line 708, in <listcomp>
    self.delta_lrs = [big - small for big, small in zip(self.max_lrs, self.min_lrs)]
TypeError: unsupported operand type(s) for -: 'float' and 'str'
Killing subprocess 35034
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/deepspeed/launcher/launch.py", line 171, in <module>
    main()
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/deepspeed/launcher/launch.py", line 161, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/deepspeed/launcher/launch.py", line 139, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ubuntu/anaconda3/envs/pytorch_latest_p37/bin/python3.7', '-u', 'run_clm.py', '--local_rank=0', '--deepspeed', 'ds_config.json', '--model_name_or_path', 'gpt2-xl', '--train_file', 'train.csv', '--validation_file', 'validation.csv', '--do_train', '--do_eval', '--fp16', '--overwrite_cache', '--evaluation_strategy=steps', '--output_dir', 'finetuned', '--eval_steps', '200', '--num_train_epochs', '1', '--gradient_accumulation_steps', '2', '--per_device_train_batch_size', '8']' returned non-zero exit status 1.

Thanks!