Overview

The past two decades have seen an explosion of high throughput genomic technologies that have revolutionized our ability to characterize the biological heterogeneity of samples. Several major consortia have been funded by the NIH and non-profit organizations to create cellular atlases of healthy and disease tissues. These consortia are utilizing a variety of genomic and imaging assays to characterize molecular features of cells within complex tissue samples. Many molecular datasets are being generated across cancer types that contain multi-modal data collected from longitudinally- and spatially-related biological specimens. A major roadblock to this goal is that the data is stored in a wide variety of file formats or programming language-specific libraries, classes, or data structures. Although a wide range of experimental protocols and platforms are available, an important commonality across these technologies is that they often produce a matrix of features that are measured in a set of observations. These feature and observation matrices (FOMs) are foundational for storing raw data from molecular assays (e.g. raw counts) and derived data from down-stream analytical tools (e.g. normalized matrix). A variety of file formats are used to store FOMs on file systems in different representations. For example, Tab Separated Value (tsv/txt) files can be used to store raw data in dense matrices while Market Exchange (.mtx) files can be used to efficiently store raw data in sparse matrices. Although platform-independent, these formats do not readily capture relationships between matrices, do not inherently contain structures for feature and observation annotations, and do not easily provide random access to subsets of the data. Several libraries and classes also exist that can capture relationships between matrices and annotations including AnnData in Python, the Seurat object in R, and the SingleCellExperiment package in R/BioConductor. In contrast to file formats, these objects can capture more complex relationships between some types of FOMs as well as annotation data. However, they are programming platform dependent and conversion between objects is required to run different tools from different platforms. In order to facilitate data sharing across groups and technologies, and assays, and to promote interoperability between down-stream analysis tools, a detailed data schema describing the characteristics of FOMs needs to be developed and will serve a standard useful for the community.

pmb59 / mams

Overview

About