After adding new cell types, the size of the new model file is smaller than that of the corresponding ENCODE model file

Question

After adding new cell types, the size of the new model file is smaller than that of the corresponding ENCODE model file

zhangyongzlm opened this issue 4 years ago · comments

I am faced with a problem. I want to add my own NPC cell types (e.g., C15, C17, C17C ... X2117) to the existing models. Finally, I found that the size of the new model file is smaller than that of the corresponding ENCODE model file.

I also try to load the newly generated model file and find that the NPC cell types are indeed added to the model.

The following is my code for training the model.

import os, sys

os.environ["THEANO_FLAGS"] = "device=cuda0"
import matplotlib.pyplot as plt
import seaborn

seaborn.set_style("whitegrid")
import itertools
import numpy

numpy.random.seed(0)
from avocado import Avocado

import pandas as pd
import argparse
import math


parser = argparse.ArgumentParser(description="Train a new model")
parser.add_argument(
    "chrom", type=str, help="Specify the chromosome that training is performed in"
)
parser.add_argument(
    "--chromSize",
    action="store",
    dest="chromSize",
    type=str,
    default="./hg38.chrom.sizes",
    help="The file storing the chrom sie information",
)
parser.add_argument(
    "--batchsize",
    action="store",
    dest="batchsize",
    type=int,
    default=40000,
    help="Batch size for neural network predictions.",
)
args = parser.parse_args()

chrom_size = pd.read_table(args.chromSize, sep="\t", names=["chr", "size"])
chrom_size.set_index(["chr"], inplace=True)

celltypes = [
    "C15",
    "C17",
    "C17C",
    "C666-1",
    "NP460",
    "NP460_EBV",
    "NP69",
    "NP69_EBV",
    "NPC23",
    "NPC32",
    "NPC43",
    "NPC43noEBV",
    "NPC53",
    "NPC76",
    "X2117",
]
assays = [
    "ChIP-seq_H3K27ac_signal_p-value",
    "ChIP-seq_H3K4me1_signal_p-value",
    "ChIP-seq_H3K4me3_signal_p-value",
]

data = {}
for celltype, assay in itertools.product(celltypes, assays):
    filename = (
        "./signals/{}/{}/{}.{}.pval.signal.bw.{}.npz".format(celltype, assay.split("_")[1], celltype, assay.split("_")[1], args.chrom)
    )
    print(filename)
    data[(celltype, assay)] = numpy.load(filename)[args.chrom]

model = Avocado.load("./avocado/.encode2018core-model/avocado-" + args.chrom)
size = chrom_size.loc[args.chrom]["size"]
model.fit_celltypes(data, epoch_size=math.ceil(size / args.batchsize), n_epochs=200)

model.save("./model/NPC_" + args.chrom)

Jacob Schreiber · Answer 1 · Tue Sep 22 2020 01:40:55 GMT+0800 (China Standard Time)

That's weird, but I'm not necessarily sure that means there's a problem. Potentially, you have a higher compression level set for hdf5 files than I did. Can you still make predictions and everything fine?