is it possible to resume training a .pkl file on the same kimg with a new datasetof pictures?

Question

is it possible to resume training a .pkl file on the same kimg with a new datasetof pictures?

nicolai256 opened this issue 2 years ago · comments

Describe the bug
I tried doing this but it gives me an error (see below)
when resume kimg with the normal dataset of images it doesn't give me this error.
I have checked if all the images are 1024px and they are.
it seems to start training but fails after the first tick.

input code
python train.py --cfg=stylegan3-t --data=C:\deepdream-test\stylegan3-fun\dataset22\images\1024.zip --aug=ada --augpipe=bg --target=0.7 --gpus=1 --batch=8 --batch-gpu=8 --mbstd-group=8 --gamma=6.6 --mirror=1 --kimg=25000 --snap=1 --metrics=none --resume=C:\deepdream-test\stylegan3-fun\training-runs\network-snapshot-005832.pkl --resume-kimg=5832

error code

Setting up augmentation...
Distributing across 1 GPUs...
Setting up training phases...
Exporting sample images...
Initializing logs...
Training for 25000 kimg...

tick 0     kimg 5832.0   time 1m 34s       sec/tick 20.5    sec/kimg 2557.87 maintenance 73.5   cpumem 4.52   gpumem 16.10  reserved 19.92  augment 0.000
Traceback (most recent call last):
  File "c:\deepdream-test\stylegan3-fun\train.py", line 324, in <module>
    main()  # pylint: disable=no-value-for-parameter
  File "C:\Users\Gebruiker\anaconda3\lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\Gebruiker\anaconda3\lib\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "C:\Users\Gebruiker\anaconda3\lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\Gebruiker\anaconda3\lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "c:\deepdream-test\stylegan3-fun\train.py", line 317, in main
    launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
  File "c:\deepdream-test\stylegan3-fun\train.py", line 95, in launch_training
    subprocess_fn(rank=0, c=c, temp_dir=temp_dir)
  File "c:\deepdream-test\stylegan3-fun\train.py", line 50, in subprocess_fn
    training_loop.training_loop(rank=rank, **c)
  File "c:\deepdream-test\stylegan3-fun\training\training_loop.py", line 260, in training_loop
    phase_real_img, phase_real_c = next(training_set_iterator)
  File "C:\Users\Gebruiker\anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 521, in __next__
    data = self._next_data()
  File "C:\Users\Gebruiker\anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 1203, in _next_data
    return self._process_data(data)
  File "C:\Users\Gebruiker\anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 1229, in _process_data
    data.reraise()
  File "C:\Users\Gebruiker\anaconda3\lib\site-packages\torch\_utils.py", line 425, in reraise
    raise self.exc_type(msg)
AssertionError: Caught AssertionError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File "C:\Users\Gebruiker\anaconda3\lib\site-packages\torch\utils\data\_utils\worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "C:\Users\Gebruiker\anaconda3\lib\site-packages\torch\utils\data\_utils\fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "C:\Users\Gebruiker\anaconda3\lib\site-packages\torch\utils\data\_utils\fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "c:\deepdream-test\stylegan3-fun\training\dataset.py", line 99, in __getitem__
    assert list(image.shape) == self.image_shape
AssertionError

Diego Porres · Answer 1 · Wed Mar 30 2022 01:49:30 GMT+0800 (China Standard Time)

Are you trying to start from your previous model? If I recall, that one was of 512x512 resolution, so you won't be able to do that (yet, it can be done, but requires a bit of time to fix). Basically, you'll need to start from a 1024 model if your dataset is 1024x1024.

nicolai256 · Answer 2 · Wed Mar 30 2022 01:51:37 GMT+0800 (China Standard Time)

I upscaled all the images, i thought all of them were 512px and upscaled them to 1024px for resuming training on my 1024 model but I just checked all of them and some were 1024px and upscaled to 2048px, that was the cause of the error.
seems to be running fine now, sorry for so much bothering