Single-cell RNA-seq datasets used in comparing different methods of differential expression analysis.
The scRNA-seq data set, MDA-MB-231 ("count_MDA.Rdata"), includes 160 single cells from a triple-negative breast cancer cell line, half of which are treated with metformin. The cells are captured using the Fluidigm C1 system and sequenced on Illumina HiSeq 2500 machines for 80 control and 80 treated cells separately. Then we use Cufflinks to estimate the isoform expression. This data set contains a total of 26,775 isoforms across 160 single cells. The average number of reads per cell is $\sim$649,000.
mESCs is collected from a public scRNA-seq data (GSE60749-GPL13112) in the Conquer repository (Soneson, C. & Robinson, M.D., 2018), which provides expression estimates of isoforms. The compared single cells are 94 individual v6.5 mouse embryonic stem cells (mESCs) with culture conditions 2i+LIF (group 1) vs. 174 v6.5 mESCs with culture conditions in serum+LIF (group 2). The data are prepared with the C1 System using the SMARTer Ultra Low RNA kit for Illumina Sequencing (Clontech) and protocols provided by Fluidigm. More details of the data can be found in the original paper (PMID: 25471879). Then the Conquer pipeline estimates isoform abundances using Salmon. This data set contains 112,593 isoforms across 174 single cells in group 1 and 94 single cells in group 2. The average number of reads per cell is $\sim$1.7M. The dataset can be found at http://imlspenticton.uzh.ch:3838/conquer.
NPCs is a subset of GSE102934 data from the NCBI Gene Expression Omnibus (PMID: 29724792). This data set has 720 neuronal progenitor cells (NPCs) derived from induced pluripotent stem (iPS) cells, half of which are from a Williams-Beuren patient and the other half are from a healthy donor. The data are sequenced on Illumina HiSeq 2500 platform and then applied massively parallel single-cell RNA sequencing (MARS-Seq) to construct single-cell libraries. This data set contains a total of 41,020 isoforms from 720 single cell, and the average number of reads per cell is 18,600. Thus, this data set has a relatively large number of cells with low sequencing coverage. The dataset can be found at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE102934.
SRP073808 is from the Conquer repository (Soneson, C. & Robinson, M.D., 2018). The compared cells in SRP073808 are 77 in vitro cultured H7 embryonic stem cells (WiCell) and 87 cells from H7-derived downstream early mesoderm progenitors. The protocol type for this data set is SMARTer C1. More details of the data can be found in the original paper (PMID: 31524596). This data set contains a total of 158,784 isoforms from 164 single cells. Library size varies from 9,023 to 3.8M reads. The average number of reads per cell is $\sim$1.5M. The dataset can be found at http://imlspenticton.uzh.ch:3838/conquer.
GSE62270 is also from the Conquer repository (Soneson, C. & Robinson, M.D., 2018). It includes 1344 cells from mouse intestinal organoids and 683 Reg4-positive intestinal cells. This data set was generated by protocol CEL-Seq. More details of the data can be found in the original paper (PMID: 26287467). This data set contains a total of 96,798 isoforms from 2,027 single cells. Library size varies from 1 to 0.6M reads. The average number of reads per cell is $\sim$9,689. The dataset can be found at http://imlspenticton.uzh.ch:3838/conquer.
The simulated data sets ("BPsimData_0.05DE_lfc1.Rdata" & "BPsimData_0.05DE_lfc4.Rdata") for isoform expression of single cells are generated by the beta-Poisson model (Vu, T.N. et al., 2016). In particular, we generate the counts for each isoform from a beta-Poisson distribution with four parameters estimated from a real data set (mESCs). The four-parameter beta-Poisson model is as follows
The mean and variance of the model can be written as
and
where
Beta-Poisson models fitted on the real mESCs data set are used as baseline distributions for simulation. For each isoform, expression values across samples in the control and the treated group are generated from the same beta-Poisson model. To mimic the biological variation, 5% of isoforms are selected to be differentially expressed between two groups (true DE isoforms). Specifically, the parameter
Related publications:
Mou, T. et al. Reproducibility of methods to detect differentially expressed genes from single-cell RNA sequencing.
Vu, T.N. et al. (2016) Beta-Poisson model for single-cell RNA-seq data analyses. Bioinformatics, btw202. http://bioinformatics.oxfordjournals.org/content/early/2016/04/18/bioinformatics.btw202
Soneson, C. & Robinson, M.D. (2018) Bias, robustness and scalability in single-cell differential expression analysis. Nature Methods 15(4):255-261. https://www.nature.com/articles/nmeth.4612