After adding new cell types, the size of the new model file is smaller than that of the corresponding ENCODE model file
zhangyongzlm opened this issue · comments
I am faced with a problem. I want to add my own NPC cell types (e.g., C15, C17, C17C ... X2117) to the existing models. Finally, I found that the size of the new model file is smaller than that of the corresponding ENCODE model file.
I also try to load the newly generated model file and find that the NPC cell types are indeed added to the model.
The following is my code for training the model.
import os, sys
os.environ["THEANO_FLAGS"] = "device=cuda0"
import matplotlib.pyplot as plt
import seaborn
seaborn.set_style("whitegrid")
import itertools
import numpy
numpy.random.seed(0)
from avocado import Avocado
import pandas as pd
import argparse
import math
parser = argparse.ArgumentParser(description="Train a new model")
parser.add_argument(
"chrom", type=str, help="Specify the chromosome that training is performed in"
)
parser.add_argument(
"--chromSize",
action="store",
dest="chromSize",
type=str,
default="./hg38.chrom.sizes",
help="The file storing the chrom sie information",
)
parser.add_argument(
"--batchsize",
action="store",
dest="batchsize",
type=int,
default=40000,
help="Batch size for neural network predictions.",
)
args = parser.parse_args()
chrom_size = pd.read_table(args.chromSize, sep="\t", names=["chr", "size"])
chrom_size.set_index(["chr"], inplace=True)
celltypes = [
"C15",
"C17",
"C17C",
"C666-1",
"NP460",
"NP460_EBV",
"NP69",
"NP69_EBV",
"NPC23",
"NPC32",
"NPC43",
"NPC43noEBV",
"NPC53",
"NPC76",
"X2117",
]
assays = [
"ChIP-seq_H3K27ac_signal_p-value",
"ChIP-seq_H3K4me1_signal_p-value",
"ChIP-seq_H3K4me3_signal_p-value",
]
data = {}
for celltype, assay in itertools.product(celltypes, assays):
filename = (
"./signals/{}/{}/{}.{}.pval.signal.bw.{}.npz".format(celltype, assay.split("_")[1], celltype, assay.split("_")[1], args.chrom)
)
print(filename)
data[(celltype, assay)] = numpy.load(filename)[args.chrom]
model = Avocado.load("./avocado/.encode2018core-model/avocado-" + args.chrom)
size = chrom_size.loc[args.chrom]["size"]
model.fit_celltypes(data, epoch_size=math.ceil(size / args.batchsize), n_epochs=200)
model.save("./model/NPC_" + args.chrom)
That's weird, but I'm not necessarily sure that means there's a problem. Potentially, you have a higher compression level set for hdf5 files than I did. Can you still make predictions and everything fine?