Validation errors are too high

Question

Validation errors are too high

yingc123 opened this issue 5 years ago · comments

Hello,

I use the training data to train the model, with one of them to be the validation set.
But the MSE are too high for validation even the training MSE is quite well.

Here is my code and the result pictures:

from __future__ import print_function

import os
import sys
import time
import numpy, itertools
from avocado import Avocado
import matplotlib

matplotlib.use("agg")
import matplotlib.pyplot as plt

celltypes = ['E003', 'E017', 'E065', 'E116', 'E117']
assays = ['H3K4me3', 'H3K27me3', 'H3K36me3', 'H3K9me3', 'H3K4me1']

data = {}
for celltype, assay in itertools.product(celltypes, assays):
    if celltype == 'E003'  and  assay == 'H3K4me3':
        continue
    filename = '/home/ey712185/data/{}.{}.pilot.arcsinh.npz'.format(celltype, assay)
    data[(celltype, assay)] = numpy.load(filename)['arr_0']

model = Avocado(celltypes, assays)

start_time = time.time()

data_validation = {}
filename_v = '/home/ey712185/data/E003.H3K4me3.pilot.arcsinh.npz'
data_validation[(celltype, assay)] = numpy.load(filename_v)['arr_0']

history = model.fit(data, data_validation, n_epochs = 600)

end_time = time.time()

running_time = (end_time - start_time) / 3600.0

print("running time {}".format(running_time))

# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.savefig("avocado_cy_E003.pdf")

model.save("avocado_cy")

the first time I use (E065, H3K4me3), second time I use (E003, H3K4me3)
The MSE for validation are always about 0.5 - 0.6, with the MSE of training is less than 0.05.
E065, H3K4me3

E003, H3K4me3

Thanks in advance!

Jacob Schreiber · Answer 1 · Tue Apr 09 2019 00:37:27 GMT+0800 (China Standard Time)

The full-sized Avocado model has tons of parameters, so it's very likely that you're overfitting the few tracks of data that you're using here. This would present itself as very high accuracy on the training set, and low accuracy on a held out validation track, which you observe. The tracks of data I provide on the GitHub repo aren't meant for training a high quality model, but just to give an example of how to train a model.

yingc123 · Answer 2 · Thu Apr 11 2019 17:32:31 GMT+0800 (China Standard Time)

OK, Thank you very much! I can try with more data.
Maybe the data I choose is not enough to train. After changing the number of neurons to a smaller value, the result is not good enough.