jmschrei / avocado

Avocado is a multi-scale deep tensor factorization model that learns a latent representation of the human epigenome and enables imputation of epigenomic experiments that have not yet been performed.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Confirmation of n_genomic_positions

Al-Murphy opened this issue · comments

Hey Jacob,

Just looking for a quick confirmation on the pilot region size used to train Avocado. I can see the number of genomic positions (n_genomic_positions) is 1126469 but is each position a 25 bp average? So then the total region passed to the model in one go is 28,161,725 bps? I think this makes sense since you divide by 10 and 200 for the 250bp and 5kbp embeddings.

Cheers,
Alan.

And sorry a bit of a follow up to this but I see in the build_model() function that the assay embedding layers look like:

genome_25bp_embedding = Embedding(n_genomic_positions, n_25bp_factors, 
		input_length=1, name="genome_25bp_embedding")

genome_250bp_embedding = Embedding(int(n_genomic_positions / 10) + 1,
		n_250bp_factors, input_length=1, name="genome_250bp_embedding")

genome_5kbp_embedding = Embedding(int(n_genomic_positions / 200) + 1, 
		n_5kbp_factors, input_length=1, name="genome_5kbp_embedding")

I note the input_dim changes but are these all passed the same 25bp average value? Or is assay averaged at the different levels first and then passed to each? Wondering in case I missed that in your code.

Question was answered via email:

As to your first post, yes, Avocado models the data at 25bp resolution where the signal values are the average within the bin, and are then arcsinh transformed.

You are correct that there are 25bp, 250bp, and 5kbp factors. These don't try to predict the data at different resolutions. Rather, all predictions are made at 25bp but use the unique 25bp factors, the 250bp factors which are shared with neighbors, and also the 5kbp factors that are shared with even more neighbors. The idea was that some factors were needed to model bin-specific signal but that most underlying phenomena were larger than 25bp, e.g. "this is a promoter" or "this is a gene body."