Help trying to adapt attention model to my data.

Question

Help trying to adapt attention model to my data.

thephet opened this issue 4 years ago · comments

Juan Manuel Parrilla Gutierrez commented 4 years ago

My code is almost a copy paste of the attention model. Even though the original code/data works fine, when I tweak it a bit for my data, it doesn't.

While this code works with music notation, my data consists of very small images (5 by 5 pixels). And they already have values between 0 and 1.

My input has a shape of 257000, 240, 50 so my sequences are 240 long, and I am concatenating and flattening two 5x5 images to get 50 points (I know this is not the best strategy, but this is only the first try). The output is 257000, 25. So just one of the images. The idea is to input sequences of pair of images, and output the next image. This code works well, and produces nice results, when doing stacked LSTMs.

My code for attention, following the link before, is as follows:

def create_network(n_in, embed_size = 100, rnn_units = 256, use_attention = True):
    """ create the structure of the neural network """

    inputs = Input(shape = (n_in.shape[1],n_in.shape[2]))

    # we will use a dense layer as embedding
    x = Dense(embed_size, activation='relu')(inputs)

    x = LSTM(rnn_units, return_sequences=True)(x)

    if use_attention:

        x = LSTM(rnn_units, return_sequences=True)(x)

        e = Dense(1, activation='tanh')(x)
        e = Reshape([-1])(e)
        alpha = Activation('softmax')(e)

        alpha_repeated = Permute([2, 1])(RepeatVector(rnn_units)(alpha))

        c = Multiply()([x, alpha_repeated])
        c = Lambda(lambda xin: K.sum(xin, axis=1), output_shape=(rnn_units,))(c)
    
    else:
        c = LSTM(rnn_units)(x)
                                    
    bz_out = Dense(25, activation = 'relu', name = 'gen_oscs')(c)
   
    model = Model(inputs, bz_out)
    

    if use_attention:
        att_model = Model(inputs, alpha)
    else:
        att_model = None

    opti = RMSprop(lr = 0.001)
    model.compile(loss='mae', optimizer=opti)

    return model, att_model

And my code to train the network:


def trainRNNGen(model, generator):

    randomize = np.arange( len(generator) - 1 ) # remove last post we filled with 0s
    np.random.shuffle(randomize)
    trainLimit = int( 0.9*len(randomize) )
    valsteps = int( 0.1*len(randomize) )

    folderpath = "/home/juanma/data/RNN_BZ/RNN_weights/"
    filepath = folderpath+"weights-{epoch:03d}-{loss:.4f}-{val_loss:.4f}.hdf5"    

    checkpoint1 = ModelCheckpoint(
        filepath, monitor='loss',
        verbose=0,
        save_best_only=True,
        period=5,
        mode='min'
    )

    checkpoint2 = ModelCheckpoint(
        os.path.join(folderpath, "weights.h5"),
        monitor='loss',
        verbose=0,
        save_best_only=True,
        mode='min'
    )

    early_stopping = EarlyStopping(
        monitor='loss'
        , restore_best_weights=True
        , patience = 100
    )

    callbacks_list = [
        checkpoint1
        , checkpoint2
        , early_stopping
     ]

    model.save_weights(os.path.join(folderpath, "weights.h5"))
    model.fit(x = customGenerator(generator, randomize[:trainLimit]), y = None,
        validation_data = customGenerator(generator, randomize[trainLimit:]),
        epochs=1000, steps_per_epoch = trainLimit, 
        validation_steps =  valsteps, 
        use_multiprocessing = False, callbacks=callbacks_list)

    return model

When I run both these functions, using my dataset, and setting use_attention to False, so the NN is just stacked LSTMs, it works fine, and the loss value goes down. But when I set use_attention to True, and it does not learn anything, and the loss function does not go down not even in the first iterations.

I think the attention model somehow is destroying the data, but at the moment I have no idea how.

Juan Manuel Parrilla Gutierrez · Answer 1 · Thu Jun 25 2020 15:47:02 GMT+0800 (China Standard Time)

I sort of fixed this. removing all the relus, and adding sigmoid only to the end, and then binary cross entropy