jmschrei / avocado

Avocado is a multi-scale deep tensor factorization model that learns a latent representation of the human epigenome and enables imputation of epigenomic experiments that have not yet been performed.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Validation errors are too high

yingc123 opened this issue · comments

Hello,

I use the training data to train the model, with one of them to be the validation set.
But the MSE are too high for validation even the training MSE is quite well.

Here is my code and the result pictures:

from __future__ import print_function

import os
import sys
import time
import numpy, itertools
from avocado import Avocado
import matplotlib

matplotlib.use("agg")
import matplotlib.pyplot as plt

celltypes = ['E003', 'E017', 'E065', 'E116', 'E117']
assays = ['H3K4me3', 'H3K27me3', 'H3K36me3', 'H3K9me3', 'H3K4me1']

data = {}
for celltype, assay in itertools.product(celltypes, assays):
    if celltype == 'E003'  and  assay == 'H3K4me3':
        continue
    filename = '/home/ey712185/data/{}.{}.pilot.arcsinh.npz'.format(celltype, assay)
    data[(celltype, assay)] = numpy.load(filename)['arr_0']

model = Avocado(celltypes, assays)

start_time = time.time()

data_validation = {}
filename_v = '/home/ey712185/data/E003.H3K4me3.pilot.arcsinh.npz'
data_validation[(celltype, assay)] = numpy.load(filename_v)['arr_0']

history = model.fit(data, data_validation, n_epochs = 600)

end_time = time.time()

running_time = (end_time - start_time) / 3600.0

print("running time {}".format(running_time))

# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.savefig("avocado_cy_E003.pdf")

model.save("avocado_cy")

the first time I use (E065, H3K4me3), second time I use (E003, H3K4me3)
The MSE for validation are always about 0.5 - 0.6, with the MSE of training is less than 0.05.
E065, H3K4me3

E003, H3K4me3

Thanks in advance!

The full-sized Avocado model has tons of parameters, so it's very likely that you're overfitting the few tracks of data that you're using here. This would present itself as very high accuracy on the training set, and low accuracy on a held out validation track, which you observe. The tracks of data I provide on the GitHub repo aren't meant for training a high quality model, but just to give an example of how to train a model.

OK, Thank you very much! I can try with more data.
Maybe the data I choose is not enough to train. After changing the number of neurons to a smaller value, the result is not good enough.