jmschrei / avocado

Avocado is a multi-scale deep tensor factorization model that learns a latent representation of the human epigenome and enables imputation of epigenomic experiments that have not yet been performed.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Num of parameters

xxmen opened this issue · comments

If I set n_genomic_positions to be the length of chromosome, and n_25bp_factors to be 25 by default, then the number of paramter in this layer will be 25 * len(chr), which is really large. Should I only train these parameters on the pilot region only? Then how to adopt these parameters to the whole chromosome (since the numbers of parameter are different, 25 * len(chr) v.s. 25*len(pilot))? What should be the right way?

Thanks.

Good question. There are two ways that you can approach this.

The first way---which we use in the paper---is to first train a model on the pilot regions, freeze the neural network, assay, and cell type factors, and then re-train the genome factors for each chromosome. This approach ensures that all the genome factors are in a common space across chromosomes.

The second approach is simply to train one model per chromosome, and not be concerned that the resulting genomic latent factors are not comparable across chromosomes.

If your goal is simply to produce the best imputations, the second approach is likely your best option. If your goal is to learn a consistent latent representation across the entire genome, you'll need to do the first thing.

Remember also that n_genomic_positions shouldn't necessarily be the length of the genome, but the length of the genome divided by 25.