This repository provides a collaborative space to specify requirements, examples, and tests for on-disk file formats used to store expression matrices arising from single-cell RNA sequencing analyses. There are currently a wide variety of formats used for these data, including generic formats (e.g. CSV) and those designed specifically for this domain (e.g. Loom).
Please open an issue or make a PR if you'd like to add ideas or start a discussion!
Associated documents that inform this repo can be found in this Google Drive and are linked throughout the repository.
We meet from time to time, here are our meeting notes, feel free to reach out and join!
- Catalog the requirements that existing or future formats may or may not satisfy
- Provide example datasets and loading scripts (cross language) for these examples
- Provide test suites that evaluate performance of formats against requirements
In many discussions, requirements fall broadly into two categories, archival (long-term storage) and analysis (daily use with analytical software in e.g. R or Python). As written, some of these are explicit requirements (e.g. self-describing), whereas others are dimensions along which different formats vary (e.g. size and speed).
- Long-term abilily to read and parse the file (does it depend on APIs or language-features that are likely to change?)
- Self-describing (are the semantics of the file contained within it?)
- Size (especially after compression)
- Partial IO (can portions of the file be read without loading the whole thing?)
- Loading subsets of genes (e.g. fitting a regression model to each gene in parallel)
- Loading subsets of cells (e.g. for differential expression)
- Making arbitrary byte-range queries (e.g. for a web service)
- Speed of reading and writing data
- Storing additional metadata or features alongide primary table (e.g. derived features or auxillery measurements)
- Optomized for sparsity (affects both speed and size)
- Ability to handle large numbers of cells (e.g. out-of-memory, memory mapping, etc.)
Here we list formats that have been used or proposed thus far in the community (please add!):
.csv
(also includes TSV).mat
(matlab).mtx
(matrix market).h5
(HDF5).h5ad
(a wrapped of HDF5 used byscanpy
).loom
(a wrapper of.h5
).npy
(serializednumpy
matrices).arrow
(not currently used but potentially promising).Robj
(serialized R objects, e.g. from Seurat).zarr
(chunked, compressed, N-dimensional arrays, python)
We benchmark data formats and APIs on standard matrices. These matrices are found in public S3 and GCP buckets: s3://matrix-format-test-data
and gs://matrix-format-test-data
.
That bucket has two relevant folders, matrices
and sources
. The sources
folder contains raw data from a number of different data sources, in the form that they were originally delivered. These are meant to span a range of matrix sizes and properties so we can develop tests for diverse use cases:
Source Name | Description |
---|---|
tenx_mouse_neuron_1M | The 1.3 million mouse neuron dataset released by 10x. |
tenx_mouse_neuron_20k | A sample of 20k cells from the 1.3 million cell dataset, also provided by 10x. |
immune_cell_census_cord_blood | Cord blood cells from the Immune Cell Atlas dataset released for the HCA preview. 384,000 cells sequenced with 10x. |
immune_cell_census_bone_marrow | Bone marrow cells from the Immune Cell Atlas dataset released for the HCA preview. 378,000 cells sequenced with 10x. |
GSE84465_whole | 3589 cells sequenced with SmartSeq2 from GEO Series GSE84465. |
GSE84465_split | 3589 cells sequenced with SmartSeq2 from GEO Series GSE84465, split so there is one matrix file per cell. |
The matrices
folder contains two levels of subfolders. The first is the matrix format. These are the formats that are being tested for performance among a set of different tasks.
Format Name | Description |
---|---|
anndata | The HDF5-based format used by scanpy |
feather | Feather format |
hdf5_10000_10000_chunks_3_compression | Dense matrix in an HDF5 dataset with (10000, 10000) chunks and gzip=3 compression |
hdf5_1000_1000_chunks_3_compression | Dense matrix in an HDF5 dataset with (1000, 1000) chunks and gzip=3 compression |
hdf5_25000_25000_chunks_3_compression | Dense matrix in an HDF5 dataset with (25000, 25000) chunks and gzip=3 compression |
matrix_market | Matrix Market format |
numpy | NumPy .npy format |
loom | Loom format |
parquet | Parquet format |
sparse_hdf5_csc | Compressed Sparse Column matrix in HDF5, similar to what is produced by Cell Ranger |
sparse_hdf5_csr | Compressed Sparse Row matrix in HDF5 |
zarr_10000_10000 | Zarr Directory Store with (10000, 10000) chunks |
zarr_1000_1000 | Zarr Directory Store with (1000, 1000) chunks |
zarr_25000_25000 | Zarr Directory Store with (25000, 25000) chunks |
The next level of subfolder is the sources, which are detailed above. So for example, s3://matrix-format-test-data/matrices/anndata/immune_cell_census_bone_marrow/anndata_immune_cell_census_bone_marrow.h5ad
is the Immune Cell Atlas bone marrow data converted into the scanpy anndata format.
The S3 paths are stored in a more machine readable format in data.yaml.
The tests are split into "tasks", where a task is some operation on an expression matrix. Within each task, there are one or more "tests", which accomplish the task using one of the matrix formats.
Each test has a "test.yaml" file that indicates how the test should be run. The test.yaml has the following keys:
- file_location: This can be "local" or "remote". If "local", the benchmarking code will first copy the input files from S3 to the local drive before running the test.
- sources: A list of the data sources on which the test should be run. Refers to keys in the data.yaml file.
- formats: A list of the matrix formats on which the test should be run. The test will be run on each combination of source and format.
- expected_output: A description of some simple properties that the output matrix should have: shape, sum of all entries, and count on non-zero entries. These are used to verify that the
The test also needs to specify a docker container with an entrypoint that accepts the following command line arguments:
- --input_paths: paths to either the local or remote input matrix files
- --output_path: path where the output matrix will be written
The details of how the inputs are converted to the outputs are within the test's docker container and any scripts or other code that gets added to it. The benchmarking code will be responsible for staging inputs, running the container, and measuring performance.
Feel free to contribute!