Training capability much worse than 2.0.5

Question

Training capability much worse than 2.0.5

AndrasEros opened this issue 7 years ago · comments

Please make sure that the boxes below are checked before you submit your issue. If your issue is an implementation question, please ask your question on StackOverflow or join the Keras Slack channel and ask there instead of filing a GitHub issue.

Thank you!

Check that you are up-to-date with the master branch of Keras. You can update with:
pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps
If running on TensorFlow, check that you are up-to-date with the latest version. The installation instructions can be found here.
If running on Theano, check that you are up-to-date with the master branch of Theano. You can update with:
pip install git+git://github.com/Theano/Theano.git --upgrade --no-deps
Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).

I happened to re-run a model that I trained some months ago. I was surprised that I can't train it to the same level as before. I did the following investigation using different versions of Kerras and Tensorflow. For better interpreting the results note that Y is normalized between -1 and +1.

Model used:

from nntools import helper
import numpy as np
import random
import win32gui
import win32con


from keras.layers import Input, LSTM, Dense, concatenate, BatchNormalization
from keras.models import Model
from keras import optimizers
from keras.callbacks import EarlyStopping, CSVLogger, ReduceLROnPlateau
from keras.utils import plot_model
import os
from phased_lstm_keras.PhasedLSTM import PhasedLSTM as PLSTM

import tensorflow as tf

os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'

hwnd = win32gui.GetForegroundWindow()
win32gui.ShowWindow(hwnd, win32con.SW_MAXIMIZE)

timesteps = 40
holdout_percentage = 0.05  #Not used now
pretrain_epochs = 40
#early_stopping_patiente = 50
datafile = "C:/Data/data40_new3.csv"

xEMA_,y_ = helper.rnn_csv_toXY(datafile,timesteps,["P","ATR"],"T1",False)

adadelta_EMA = optimizers.adadelta()
adam_EMA=optimizers.adam()
sgd_EMA = optimizers.SGD(lr=0.01, decay=4e-5, momentum=0.9, nesterov=False)  #LSTM
sgd_EMA_PLSTM = optimizers.SGD(lr=0.01, decay=4e-5, momentum=0.2, nesterov=False)  #PLSTM
reduce_lr_EMA = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3, verbose=1, cooldown=1, min_lr=0.005)  #For SGD
#im = IncreaseMomentum(step=0.2, max_momentum=0.7)
#early_stopping_EMA = EarlyStopping(monitor='val_loss', min_delta=0.0001,
                                                                        #patience=early_stopping_patiente, mode='auto')
with tf.device('/gpu:0'):
    ema_in = Input(name='ema_in', shape=(xEMA_.shape[1],xEMA_.shape[2]))
    ema_in_BN = BatchNormalization()(ema_in)
    ema_lstm1 = LSTM(1200, name='ema_lstm1', implementation=2, return_sequences=True)(ema_in_BN)
    ema_lstm1_BN = BatchNormalization()(ema_lstm1)
    ema_lstm2 = LSTM(1200, name='ema_lstm2', implementation=2, return_sequences=False)(ema_lstm1_BN)
    ema_lstm2_BN = BatchNormalization()(ema_lstm2)
    ema_dense1 = Dense(2400, name='ema_dense1', activation='tanh')(ema_lstm2_BN)
    ema_dense1_BN = BatchNormalization()(ema_dense1)
    ema_dense2 = Dense(1200, name='ema_dense2', activation='tanh')(ema_dense1_BN)
    ema_dense2_BN = BatchNormalization()(ema_dense2)
    ema_dense3 = Dense(600, name='ema_dense3', activation='tanh')(ema_dense2_BN)
    ema_dense3_BN = BatchNormalization()(ema_dense3)
    ema_output = Dense(1, name='ema_output', activation='tanh')(ema_dense3_BN)

    ema_model = Model(inputs=[ema_in], outputs=[ema_output])
    reduce_lr_M_adadelta = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3, verbose=1, cooldown=1, min_lr=0.5)  #For adadelta
    reduce_lr_M = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3, verbose=1, cooldown=1, min_lr=0.005)  #For SGD
    early_stopping_M = EarlyStopping(monitor='val_loss', min_delta=0.001, patience=40, mode='auto')
    csv_logger = CSVLogger('training_log3_209_LSTM_SGD.csv')

    ema_model.compile(optimizer=sgd_EMA,
              loss={'ema_output': 'mean_squared_error'}, metrics=['mae'])
    #ema_model.compile(optimizer=adadelta_EMA,
    #          loss={'ema_output': 'mean_squared_error'}, metrics=['mae'])
    print("Train EMA on GPU0...")
    ema_model.fit({'ema_in': xEMA_},
          {'ema_output': y_},
          epochs=500, batch_size=40, validation_split=0.1,
          callbacks=[csv_logger, reduce_lr_M, early_stopping_M])

Results with Keras 2.0.9 and TF 1.4.0:

epoch,loss,mean_absolute_error,val_loss,val_mean_absolute_error

0,1.05968262962,0.985275764846,1.10819295021,1.03706740068

1,0.979859521839,0.939838988512,1.0076873856,0.979688630307

2,0.946857691283,0.913535889255,0.750179696942,0.825602595291

3,0.762938497789,0.785929248137,1.08659711466,1.02602888577

4,0.501235865148,0.589149763212,0.132253268072,0.307935305328

5,0.2807936117,0.413878542905,0.114725752961,0.292098902317

6,0.139581092516,0.291962343971,0.070717339354,0.21767100456

7,0.114901034081,0.263645202189,0.0706408966877,0.217224851168

8,0.101012383458,0.245739355278,0.0468780662885,0.176116148001

9,0.0997591257731,0.244218098339,0.0869001861954,0.24957025293

10,0.0957704882136,0.238081805803,0.0793308527111,0.237788404863

11,0.0917768303432,0.232730614943,0.0355999692114,0.154031230826

12,0.0880096098064,0.227360123548,0.0304724577584,0.140374521908

13,0.0856908053515,0.223671286127,0.0345248061943,0.150562275536

14,0.0844831893862,0.22217891928,0.0360884426361,0.154798901412

15,0.0844001181336,0.221470015169,0.030373785525,0.140394812572

Loss does not decrease 0.08 even after longer time.

Results with Keras 2.0.5 and TF 1.2.0:

epoch,loss,mean_absolute_error,val_loss,val_mean_absolute_error

0,0.674994680866,0.787532132172,0.210298314454,0.401246211335

1,0.50003129797,0.640251753075,0.186895873197,0.360440107238

2,0.0864018025559,0.23287392271,0.0184035980327,0.112992629679

3,0.0415469702159,0.159678566147,0.0345238949807,0.166836664362

4,0.0323593134869,0.141381580302,0.0117207404594,0.0815250731591

5,0.0255187416594,0.125329670333,0.00788067547078,0.0707015042412

6,0.0199052717129,0.110094056107,0.0146587090298,0.0996316193543

7,0.0194325417159,0.108653233154,0.022364237375,0.12307138551

8,0.0179296625274,0.103960103221,0.0104151797266,0.0821187130446

9,0.0156120462513,0.0965747672158,0.00753274433028,0.0676462595845

I have ran these a few times and results are always showing the same difference between the two versions. It looks clear that the older version is superior in training speed and converges to a much lower loss. 2.0.5 achieves mse 0.015 after 9 epoch while 2.0.9 is at mse 0.08 after epoch 15. Also version 2.0.9 never achieves mse 0.015 it gets stuck around 0.08 lagging significantly behind the earlier version. I tried to trace back when the change was introduced between the two versions so I ran:

Results with Keras 2.0.6 and TF 1.2.0

epoch,loss,mean_absolute_error,val_loss,val_mean_absolute_error

0,1.01219834391,0.958042641546,0.323887029367,0.507626796452

1,0.987153389447,0.947846812448,0.991314166002,0.972727180805

2,0.913755040475,0.895092398818,0.960553074588,0.954757456198

3,0.887703684235,0.877553070937,0.319060129579,0.459220583437

4,0.42077836859,0.534649828892,0.427964422614,0.602694774897

5,0.166042508874,0.321681825808,0.0520719601093,0.188639527333

6,0.130638394979,0.283327733159,0.0684811605399,0.21664535411

7,0.111096850008,0.259380427971,0.0437811681681,0.170154755492

8,0.105542496341,0.25173293699,0.0437140094443,0.16746392877

9,0.0990599608695,0.243373310091,0.0838070973706,0.238413533012

10,0.0927036643884,0.234348856499,0.0374346287566,0.158595657002

11,0.0888211610787,0.228353776273,0.0463573170836,0.176301127744

12,0.0889625790833,0.228434735812,0.0604750635987,0.202052195814

13,0.0873379450592,0.226243785039,0.0420134068426,0.164519093145

14,0.0842049010805,0.221496316829,0.0536444546858,0.186482944605

15,0.0756696959965,0.20801132382,0.0432103228089,0.169859484031

16,0.0757876485275,0.208129870598,0.0326403788479,0.145170186645

17,0.075430783716,0.207450285694,0.0314801485713,0.143648955088

18,0.0749894291144,0.206845118805,0.0369722390038,0.154727127577

It appears that upgrading keras from 2.0.5 to 2.0.6 causes the decreased efficiency in training therefore it's likely that the problem was introduced with 2.0.6 and it's still there ever since. I ran all tests with same environment with same data, the only change I did was upgrading/downgrading keras and TF.
Can someone check with another model that 2.0.5 is that much better?
Is there perhaps a known issue? I could not find by searching.

Alex · Answer 1 · Fri Nov 10 2017 22:54:44 GMT+0800 (China Standard Time)

Out of curiosity, could you try with TF itself as a baseline and/or try with another backed ( if installing it isn't too complicated for you).

François Chollet · Answer 2 · Sat Nov 11 2017 03:19:41 GMT+0800 (China Standard Time)

Do you also observe a difference if you use GRU instead of LSTM?

Do you also observe a difference if you use tf.keras in TF 1.4 instead of PyPI Keras?

gmrhub · Answer 3 · Thu Nov 16 2017 06:18:45 GMT+0800 (China Standard Time)

I faced similar situation 2.0.0 (with tf 1.1, cuda 8, cudnn 5.1) vs 2.0.9 (with tf 1.4, cuda 8, cudnn 6).
With so many differences I did not have time to investigate further.

So, wondering if error averaging is changed somehow across the batches per epoch.

Another issue with keras 2.1.1, when I use fit_generator with batch_size = 16, steps_per_epoch=348 (more than the actual samples i.e. 174), it ends the epoch at 11/384 th batch only and starts new epoch. Not sure why it is breaking the epoch, it did not happen before, with 2.0.9 and older. Clearly, 11th batch will have less than 16 samples, don't know why it should cause problem.

AndrasEros · Answer 4 · Fri Dec 08 2017 04:44:18 GMT+0800 (China Standard Time)

While further investigating this issue I ended up receiving NaN as loss when I tried to use different optimizers. To handle that issue I uninstalled everything that was Nvidia on my machine and made a clean install. I also reinstalled keras (now 2.1.2) and TF (1.4). I also removed all the batchnorm layers and NaN is gone, however the model still fails to train with adam but works very well with adadelta as expected (OK, this is probably specific to my data/problem). Now I'm after the batchnorm issue.

Taejun Kim · Answer 5 · Tue Feb 06 2018 19:16:27 GMT+0800 (China Standard Time)

Hi, I'm having the same problem. Are there any updates?

My project is a music classification and uses Keras v2.0.4 + TF v1.1.0, which are so outdated.
So, I upgrade them to Keras v2.1.3 + TF v1.4.1, but the performance (ROC-AUC) was so much worse.

Therefore, I investigated by decreasing Keras version from v2.1.3 to v2.0.5 (fixed TF version to 1.1.0).
And it turns out that the performance is bad from v2.1.3 to v2.0.6 and it becomes fine at v2.0.5.
I guess there was something wrong when the version increased to v2.0.6.

I read release note and comparing changes to figure out what's wrong, but I couldn't find it!

Do you have any ideas which can possibly be a problem???
It didn't fix at the latest version, so we should figure out together!

Thanks!

AndrasEros · Answer 6 · Wed Feb 07 2018 08:22:55 GMT+0800 (China Standard Time)

@tae-jun
We must have something in common. It seems most users didn't notice anything, it's only few of us. Why?
Can you please share your code that behaves differently with different versions?
Can you share your hardware setup?
Can you share your software setup?

Taejun Kim · Answer 7 · Thu Feb 08 2018 15:28:47 GMT+0800 (China Standard Time)

@AndrasEros Thanks for your quick response!

https://github.com/tae-jun/sample-cnn
This is the project I'm working on! Its task is music classification. The CNN architecture is HERE.

The symptom is so obvious even after only ONE epoch. Below is the training history for each version.

Keras 2.0.5

Epoch 1/100
6631/6631 [==============================] - 1139s - loss: 0.2004 - val_loss: 0.1710
Epoch 2/100
6631/6631 [==============================] - 1133s - loss: 0.1710 - val_loss: 0.1602
Epoch 3/100
6631/6631 [==============================] - 1135s - loss: 0.1616 - val_loss: 0.1571
Epoch 4/100
6631/6631 [==============================] - 1135s - loss: 0.1566 - val_loss: 0.1535

Keras 2.0.6

Epoch 1/100
6631/6631 [==============================] - 1159s - loss: 0.2083 - val_loss: 0.1825
Epoch 2/100
6631/6631 [==============================] - 1150s - loss: 0.1824 - val_loss: 0.1736
Epoch 3/100
6631/6631 [==============================] - 1147s - loss: 0.1726 - val_loss: 0.1601
Epoch 4/100
6631/6631 [==============================] - 1149s - loss: 0.1666 - val_loss: 0.1583

Hardware Setup

GTX 1080Ti x2
(Which information could be helpful?)

Software Setup

CentOS 7.3
CUDA 8 / CuDNN 6
Anaconda 4.3.24 (with Python 3.5)
TensorFlow 1.1.0

Please ask me any other information if you need! Thanks 😄

AndrasEros · Answer 8 · Fri Feb 09 2018 05:15:48 GMT+0800 (China Standard Time)

Very interesting!
Similarities that matter:

I have the same hardware, 2x GTX 1080Ti
We both use Batchnormalization that was touched in Keras 2.0.6

Differences:

I'm on Windows 10
I have Anaconda 4.2.0 (64 bit)

Now I'm just brainstorming what we can check:

I'm not sure Windows and Linux drivers are connected but we can check GPU driver. Mine is 388.13. I upgraded it same time when I upgraded Keras to 2.0.6. We can search open issues or try to downgrade.
We both have a multi-GPU environment and Keras has a multi-GPU processing that is called from fit_generator that was touched in 2.0.6. We both should try to hide one of the GPU so TF and Keras can't see it. How to do it.
Remove Batchnorm completely from our models, don't even import it and try again 2.0.5 vs. 2.0.6.

Additional ideas are welcome from anyone!

Taejun Kim · Answer 9 · Fri Feb 09 2018 10:22:14 GMT+0800 (China Standard Time)

My GPU driver version is 381.22
I have 2 GPUs but I use only one GPU for a training, so I guess it's not the reason 😥
I didn't know that BatchNorm was touched in 2.0.6! I should remove BatchNorms and compare performances. I will let you know the result! 😄

Luke de Oliveira · Answer 10 · Fri Feb 09 2018 12:21:24 GMT+0800 (China Standard Time)

How long do your train-eval loops take? If short(ish) you could use git bisect to find the problematic commit.