Michael Love, Rob Patro
tximeta
performs numerous annotation and metadata gathering tasks on
behalf of users during the import of transcript quantifications from
Salmon or Sailfish into R/Bioconductor. The goal is to provide
something similar to the experience of GEOquery
, which downloaded
microarray expression data from NCBI GEO and simultaneously brought
along associated pieces of metadata. Doing this automatically helps to
prevent costly bioinformatic errors. To use tximeta
, all one needs
is the quant
directory output from Salmon (version >= 0.8.1) or
Sailfish.
The key idea within tximeta
is to store a signature of
the transcriptome sequence itself using a hash function, computed and
stored by the index
and quant
functions of Salmon and
Sailfish. This signature acts as the identifying information for later
building out rich annotations and metadata in the background, on
behalf of the user. This should greatly facilitate genomic workflows,
where the user can immediately begin overlapping their transcriptomic
data with other genomic datasets, e.g. epigenetic tracks such as ChIP
or methylation, as the data has been embedded within an organism and
genome context, including the proper genome version. We seek to
reduce wasted time of bioinformatic analysts, prevent costly
bioinformatic mistakes, and promote computational reproducibility by
avoiding situations of annotation and metadata ambiguity, when files
are shared publicly or among collaborators but critical details go
missing.
Expect that this package will change a lot in the coming months. This is a prototype for how automatic generation of transcriptome metadata from a transcriptome sequence signature might work. Note that, as it is just a prototype, it only works for a single transcriptome (Gencode human v26), although the long term goal will be to automate signature generation for as many transcriptomes as possible, including different versions, sources, organisms, etc.
In addition, we are very interested in solving problem cases for this
approach, such as
derived transcriptomes
(e.g. filtered, or edited after downloading from source) and de novo
transcriptomes, such as those generated by StringTie, Trinity,
Scripture, Oases, etc.
We hope that for both of these cases tximeta
might help to assist in
computational reproducibility of quantification, by encapsulating the
steps need to generate the transcriptome and providing a signature for
checking equality.