Validation errors are too high
yingc123 opened this issue · comments
Hello,
I use the training data to train the model, with one of them to be the validation set.
But the MSE are too high for validation even the training MSE is quite well.
Here is my code and the result pictures:
from __future__ import print_function
import os
import sys
import time
import numpy, itertools
from avocado import Avocado
import matplotlib
matplotlib.use("agg")
import matplotlib.pyplot as plt
celltypes = ['E003', 'E017', 'E065', 'E116', 'E117']
assays = ['H3K4me3', 'H3K27me3', 'H3K36me3', 'H3K9me3', 'H3K4me1']
data = {}
for celltype, assay in itertools.product(celltypes, assays):
if celltype == 'E003' and assay == 'H3K4me3':
continue
filename = '/home/ey712185/data/{}.{}.pilot.arcsinh.npz'.format(celltype, assay)
data[(celltype, assay)] = numpy.load(filename)['arr_0']
model = Avocado(celltypes, assays)
start_time = time.time()
data_validation = {}
filename_v = '/home/ey712185/data/E003.H3K4me3.pilot.arcsinh.npz'
data_validation[(celltype, assay)] = numpy.load(filename_v)['arr_0']
history = model.fit(data, data_validation, n_epochs = 600)
end_time = time.time()
running_time = (end_time - start_time) / 3600.0
print("running time {}".format(running_time))
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.savefig("avocado_cy_E003.pdf")
model.save("avocado_cy")
the first time I use (E065, H3K4me3), second time I use (E003, H3K4me3)
The MSE for validation are always about 0.5 - 0.6, with the MSE of training is less than 0.05.
E065, H3K4me3
Thanks in advance!
The full-sized Avocado model has tons of parameters, so it's very likely that you're overfitting the few tracks of data that you're using here. This would present itself as very high accuracy on the training set, and low accuracy on a held out validation track, which you observe. The tracks of data I provide on the GitHub repo aren't meant for training a high quality model, but just to give an example of how to train a model.
OK, Thank you very much! I can try with more data.
Maybe the data I choose is not enough to train. After changing the number of neurons to a smaller value, the result is not good enough.