Training on Google Colab immediately stops

Question

Training on Google Colab immediately stops

szymek1 opened this issue 2 years ago · comments

Hello,
I'm constantly facing this issue:
I try to train whichever model avaiable and it immediatley stops right after it started. I set up environment which I guess should be fine:

pytorch 1.0.1, torchvision 0.2.1(I also tested torchvision 0.2.2 and pytorch 1.3.0 without any successes)

I set batch size to 1, because I thought that maybe there is a problem with to big batch size, as I have only 1 GPU.
My guess is that on VM from colab the nviddia drivers, CUDA and cuDNN are much younger than what was used back in 2019. Nevertheless, here is my configuration as well as outcome. Please, help me guys!

MODE: 1 # 1: train, 2: test, 3: eval
MODEL: 2 # 1: edge model, 2: inpaint model, 3: edge-inpaint model, 4: joint model
MASK: 3 # 1: random block, 2: half, 3: external, 4: (external, random block), 5: (external, random block, half)
EDGE: 1 # 1: canny, 2: external
NMS: 1 # 0: no non-max-suppression, 1: applies non-max-suppression on the external edges by multiplying by Canny
SEED: 10 # random seed
GPU: [0] # list of gpu ids
DEBUG: 0 # turns on debugging mode
VERBOSE: 1 # turns on verbose mode in the output console

TRAIN_FLIST: xxxx
VAL_FLIST: xxxx
TEST_FLIST: xxxx

TRAIN_EDGE_FLIST: ./datasets/places2_edges_train.flist
VAL_EDGE_FLIST: ./datasets/places2_edges_val.flist
TEST_EDGE_FLIST: ./datasets/places2_edges_test.flist

TRAIN_MASK_FLIST: xxxx
VAL_MASK_FLIST: xxxx
TEST_MASK_FLIST: xxxx

LR: 0.001 # learning rate
D2G_LR: 0.1 # discriminator/generator learning rate ratio
BETA1: 0.0 # adam optimizer beta1
BETA2: 0.9 # adam optimizer beta2
BATCH_SIZE: 1 # input batch size for training
INPUT_SIZE: 256 # input image size for training, 256 for original size
SIGMA: 2 # standard deviation of the Gaussian filter used in Canny edge detector (0: random, -1: no edge)
MAX_ITERS: 2 # maximum number of iterations to train the model

EDGE_THRESHOLD: 0.5 # edge detection threshold
L1_LOSS_WEIGHT: 1 # l1 loss weight
FM_LOSS_WEIGHT: 10 # feature-matching loss weight
STYLE_LOSS_WEIGHT: 250 # style loss weight
CONTENT_LOSS_WEIGHT: 0.1 # perceptual loss weight
INPAINT_ADV_LOSS_WEIGHT: 0.1 # adversarial loss weight

GAN_LOSS: nsgan # nsgan | lsgan | hinge
GAN_POOL_SIZE: 0 # fake images pool size

SAVE_INTERVAL: 2 # how many iterations to wait before saving model (0: never)
SAMPLE_INTERVAL: 2 # how many iterations to wait before sampling (0: never)
SAMPLE_SIZE: 24 # number of images to sample
EVAL_INTERVAL: 2 # how many iterations to wait before model evaluation (0: never)
LOG_INTERVAL: 1 # how many iterations to wait before logging training status (0: never)

start training...

Training epoch: 1

End training....

Gh1874 · Answer 1 · Wed Aug 03 2022 21:03:39 GMT+0800 (China Standard Time)

Have you solved this problem? I think issue 54# have the same problem with you and you can check it out.

BTW, I'm trying to train on my own dataset as well. And I'm confused about the edge.flist (i.e. what you used in your config), I'm not sure which data should I use in each training stage. Could you please share some tips on it?