justinpinkney / stable-diffusion

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TypeError: Gradient accumulation supports only int and dict types

jianpingliu opened this issue · comments

Followed the fine tune instruction, got this error:

Merged modelckpt-cfg:
{'target': 'pytorch_lightning.callbacks.ModelCheckpoint', 'params': {'dirpath': 'logs/2022-10-05T06-00-26_pokemon/checkpoints', 'filename': '{epoch:06}', 'verbose': True, 'save_last': True, 'monitor': None, 'save_top_k': -1, 'every_n_train_steps': 2000}}
/usr/local/lib/python3.7/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:433: UserWarning: ModelCheckpoint(save_last=True, save_top_k=None, monitor=None) is a redundant configuration. You can save the last checkpoint with ModelCheckpoint(save_top_k=None, monitor=None).
"ModelCheckpoint(save_last=True, save_top_k=None, monitor=None) is a redundant configuration."
ModelCheckpoint(save_last=True, save_top_k=-1, monitor=None) will duplicate the last checkpoint saved.
Traceback (most recent call last):
File "main.py", line 812, in
trainer = Trainer.from_argparse_args(trainer_opt, **trainer_kwargs)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/properties.py", line 421, in from_argparse_args
return from_argparse_args(cls, args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/argparse.py", line 52, in from_argparse_args
return cls(**trainer_kwargs)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 40, in insert_env_defaults
return fn(self, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 446, in init
terminate_on_nan,
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/connectors/training_trick_connector.py", line 50, in on_trainer_init
self.configure_accumulated_gradients(accumulate_grad_batches)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/connectors/training_trick_connector.py", line 66, in configure_accumulated_gradients
raise TypeError("Gradient accumulation supports only int and dict types")
TypeError: Gradient accumulation supports only int and dict types

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "main.py", line 909, in
if trainer.global_rank == 0:
NameError: name 'trainer' is not defined

Maybe looks like the call to main.py is wrong, specifically this bit:

lightning.trainer.accumulate_grad_batches=1

should be an integer >0

Thanks! It really helped.

Rather I think the issue is that we're passing accumulate_grad_batches=None to Trainer.from_argparse_args. Fixed the above error by adding

diff --git a/main.py b/main.py
index b21a775..c2a6e2f 100644
--- a/main.py
+++ b/main.py
@@ -835,6 +835,7 @@ if __name__ == "__main__":
             from pytorch_lightning.trainer.connectors.checkpoint_connector import CheckpointConnector
             setattr(CheckpointConnector, "hpc_resume_path", None)

+        trainer_kwargs['accumulate_grad_batches'] = 1
         trainer = Trainer.from_argparse_args(trainer_opt, **trainer_kwargs)
         trainer.logdir = logdir  ###