Output heatmap is zeros

Xonxt opened this issue · comments

Hi, I'm trying to train a structure similar to Openpose, but with only two additional stages, no PAF branch, depth-image (from RealSense) as input and only 3 keypoints (I only need both hands and face for my application, so the output is a (batch, h/8, w/8, 3) stack). The training is done with Keras. I've got only around ~5000 training images, but with some extensive augmentation. The ground-truth is appropriately scaled heatmaps with Gaussian peaks in place of keaypoints.

I notice, that the output heatmap is basically always just a matrix of zeros immediately after the first 2-3 iterations. Nothing on it. Up until now I've only had enough patience to let it run for about ~1500 iterations (more than a day) on two GPUs and the loss basically always stays around the same value, with the output being just zeros.

Do you think I might be doing something fundamentally wrong, or do I just have to have enough patience to wait for 300'000+ iterations just like in your original implementation?

By the way, when you talk about iterations, do you mean epochs or epochs*steps_per_epoch?

Hi, I am facing a similar issue with almost the same setting as yours. Did you solve the problem after that?

yeah, I'd switched to the SGD optimizer (using an initial learning rate of 2e-5 and a ReduceLROnPlateau callback), made the Gaussian peaks (for the keypoints) a bit larger, added many more training samples and just generally waited for the training to run a bit longer. I was able to see acceptable results already after a thousand epochs.

My settings:

# optimizer:
sgd = SGD(lr=2e-05, decay=0.0, momentum=0.9, nesterov=False)

# loss function:
def _heat_loss(x,y):
    return K.sum(K.square(x - y)) / 2

# compiling model:
model.compile(optimizer=sgd, loss=_heat_loss)

# callback:
from keras.callbacks import ReduceLROnPlateau
reduce_learning_rate = ReduceLROnPlateau(monitor='loss', factor=gamma, patience=50, verbose=1)

Hope that helps.

Hi @Xonxt,

Thank you so much! You saved my day.

I switched from Adam to SGD and reduced the kernel size from 7 to 3 in all convolutional layers. Everything started to work.