jmschrei / avocado

Avocado is a multi-scale deep tensor factorization model that learns a latent representation of the human epigenome and enables imputation of epigenomic experiments that have not yet been performed.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

After adding new cell types, the size of the new model file is smaller than that of the corresponding ENCODE model file

zhangyongzlm opened this issue · comments

I am faced with a problem. I want to add my own NPC cell types (e.g., C15, C17, C17C ... X2117) to the existing models. Finally, I found that the size of the new model file is smaller than that of the corresponding ENCODE model file.
image
image
I also try to load the newly generated model file and find that the NPC cell types are indeed added to the model.
image

The following is my code for training the model.

import os, sys

os.environ["THEANO_FLAGS"] = "device=cuda0"
import matplotlib.pyplot as plt
import seaborn

seaborn.set_style("whitegrid")
import itertools
import numpy

numpy.random.seed(0)
from avocado import Avocado

import pandas as pd
import argparse
import math


parser = argparse.ArgumentParser(description="Train a new model")
parser.add_argument(
    "chrom", type=str, help="Specify the chromosome that training is performed in"
)
parser.add_argument(
    "--chromSize",
    action="store",
    dest="chromSize",
    type=str,
    default="./hg38.chrom.sizes",
    help="The file storing the chrom sie information",
)
parser.add_argument(
    "--batchsize",
    action="store",
    dest="batchsize",
    type=int,
    default=40000,
    help="Batch size for neural network predictions.",
)
args = parser.parse_args()

chrom_size = pd.read_table(args.chromSize, sep="\t", names=["chr", "size"])
chrom_size.set_index(["chr"], inplace=True)

celltypes = [
    "C15",
    "C17",
    "C17C",
    "C666-1",
    "NP460",
    "NP460_EBV",
    "NP69",
    "NP69_EBV",
    "NPC23",
    "NPC32",
    "NPC43",
    "NPC43noEBV",
    "NPC53",
    "NPC76",
    "X2117",
]
assays = [
    "ChIP-seq_H3K27ac_signal_p-value",
    "ChIP-seq_H3K4me1_signal_p-value",
    "ChIP-seq_H3K4me3_signal_p-value",
]

data = {}
for celltype, assay in itertools.product(celltypes, assays):
    filename = (
        "./signals/{}/{}/{}.{}.pval.signal.bw.{}.npz".format(celltype, assay.split("_")[1], celltype, assay.split("_")[1], args.chrom)
    )
    print(filename)
    data[(celltype, assay)] = numpy.load(filename)[args.chrom]

model = Avocado.load("./avocado/.encode2018core-model/avocado-" + args.chrom)
size = chrom_size.loc[args.chrom]["size"]
model.fit_celltypes(data, epoch_size=math.ceil(size / args.batchsize), n_epochs=200)

model.save("./model/NPC_" + args.chrom)

That's weird, but I'm not necessarily sure that means there's a problem. Potentially, you have a higher compression level set for hdf5 files than I did. Can you still make predictions and everything fine?