Cannot train on GPU

Question

Cannot train on GPU

ginacode opened this issue 5 years ago · comments

When I run the progan using pytorch for GPU, I get:

Starting the training process ...


Currently working on Depth:  0
Current resolution: 4 x 4

Epoch: 1
Traceback (most recent call last):
  File "progan.py", line 39, in <module>
    feedback_factor=2
  File "/scratch2/virtualenv/lib/python3.7/site-packages/pro_gan_pytorch/PRO_GAN.py", line 1046, in train
    labels, current_depth, alpha)
  File "/scratch2/virtualenv/lib/python3.7/site-packages/pro_gan_pytorch/PRO_GAN.py", line 865, in optimize_discriminator
    labels, depth, alpha)
  File "/scratch2/virtualenv/lib/python3.7/site-packages/pro_gan_pytorch/Losses.py", line 345, in dis_loss
    fake_out = self.dis(fake_samps, labels, height, alpha)
  File "/scratch2/virtualenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/scratch2/virtualenv/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/scratch2/virtualenv/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/scratch2/virtualenv/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
    raise output
  File "/scratch2/virtualenv/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
    output = module(*input, **kwargs)
  File "/scratch2/virtualenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/scratch2/virtualenv/lib/python3.7/site-packages/pro_gan_pytorch/PRO_GAN.py", line 305, in forward
    out = self.final_block(y, labels)
  File "/scratch2/virtualenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/scratch2/virtualenv/lib/python3.7/site-packages/pro_gan_pytorch/CustomLayers.py", line 445, in forward
    labels = self.label_embedder(labels)  # [B x C]
  File "/scratch2/virtualenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/scratch2/virtualenv/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 117, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/scratch2/virtualenv/lib/python3.7/site-packages/torch/nn/functional.py", line 1506, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: diff_view_meta->output_nr_ == 0 ASSERT FAILED at /pytorch/torch/csrc/autograd/variable.cpp:209, please report a bug to PyTorch.

But when I run it using pytorch for CPU only, it works but works very, very slowly. Any idea what could be causing this and is there any way I can work with GPU support?

ginacode · Answer 1 · Tue Sep 03 2019 09:37:40 GMT+0800 (China Standard Time)

This is the code I am using, by the way. I am trying to train on 1024x512 images.

import torch as th
import pro_gan_pytorch.PRO_GAN as pg
import matplotlib.pyplot as plt
import os
from torchvision import datasets, transforms
from PIL import Image, ImageChops


device = th.device("cuda" if th.cuda.is_available() else "cpu")

def setup_data():
    dataset = datasets.ImageFolder(
        root = 'total_intensity/',
        transform = transforms.Compose([
            transforms.Resize((512,512)),
            transforms.ToTensor(),
            transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) ]))
    return dataset


if __name__ == '__main__':
    depth = 8
    num_epochs = [50, 50, 50, 50, 50, 50, 50, 50]
    fade_ins = [50, 50, 50, 50, 50, 50, 50, 50]
    batch_sizes = [32, 32, 32, 32, 32, 32, 32, 32]
    latent_size = 512


    dataset = setup_data()

    pro_gan = pg.ConditionalProGAN(num_classes=1, depth=depth, 
                                   latent_size=latent_size, device=device)

    pro_gan.train(
        dataset=dataset,
        epochs=num_epochs,
        fade_in_percentage=fade_ins,
        batch_sizes=batch_sizes,
        feedback_factor=2
   )

Animesh Karnewar · Answer 2 · Sun Sep 08 2019 18:31:27 GMT+0800 (China Standard Time)

@ginacode,

The network architecture unfortunately doesn't support images of different shapes like 1024 x 512 that you are using. Could you try padding the second dimension to 1024 to get square images with dimension equal to a power of 2 greater than 4?

Please let me know if you have any other problems.

cheers 🍻!
@akanimax

ginacode · Answer 3 · Fri Sep 20 2019 20:57:30 GMT+0800 (China Standard Time)

I should be resizing the images to 512 x 512 before I run progan (see setup_data()).