Information/documentation on data used to train model and required inputs

Question

Information/documentation on data used to train model and required inputs

akmorrow13 opened this issue 3 years ago · comments

akmorrow13 commented 3 years ago

Hello,

I have 2 questions:

I was wondering if there was documentation available on how you created the data used to train models that are available (ie what the npy file format is, what files were used as input, and available code to perform the transformation)? This is specifically useful if you want to apply Avocado to a new dataset.
Additionally, I was wondering if the transformed data used to train the full models was available somewhere. The data available in /data wouldn't be sufficient to train anything more than a toy model.

Thanks in advance.

Jacob Schreiber · Answer 1 · Sat Apr 10 2021 05:20:15 GMT+0800 (China Standard Time)

Howdy

Thanks for the questions.

I'm not sure if I made it available here, but I basically just extracted a chromosome of signal from a bigwig file, truncated it to be divisible by 25, and took the average over each 25 bp bin. Here is code off the top of my head (so may have small typos) that does that:

import numpy
import pyBigWig

chrom = 'chr18'
bw = pyBigWig.open(filename, "r")

signal = bw.values(chrom, 0, -1, numpy=True)
signal = numpy.nan_to_num(signal) # pyBigWig stores 0 counts as NaN

n = signal.shape[0] // 25 * 25
signal = signal[:n].reshape(n // 25, 25).mean(axis=1)
signal = numpy.arcsinh(signal)

You would then create a dictionary where the keys are (cell type, assay) tuples and the values are the numpy arrays like signal.

The npy file format is the built-in data store for numpy arrays.

Depending on the model, the data sets were derived either from the Roadmap compendium (https://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidated/macs2signal/pval/) or the ENCODE compendium. The second data set is massive (>3k bigwigs) and difficult to share because it lives on the ENCODE portal. I've added a metadata spreadsheet (https://github.com/jmschrei/avocado/blob/master/data/ENCODE2018Core.tsv) with download URLs (concatenate those to https://www.encodeproject.org/).

Sorry for not providing tons of documentation on it. By the time everything was wrapped up I was pretty burnt out and wasn't sure if people would be using that functionality, rather than just the imputed tracks and learned representations.

The link above and the metadata spreadsheet should contain all the bigwigs. This is a lot of data, though. Managing it all was def a ~difficult part of the project overall.

Let me know if you have any other questions!

akmorrow13 · Answer 2 · Sat Apr 10 2021 06:04:37 GMT+0800 (China Standard Time)

This is great, thank you so much for the details! I will close this.