Improving memory efficiency for CITE-seq data analysis

Question

Improving memory efficiency for CITE-seq data analysis

LTLA opened this issue 5 years ago · comments

I would like this to work better than it currently does:

## ------------------------------------------------------------------------
# Caching it locally with BiocFileCache to avoid repeating it.
library(BiocFileCache)
bfc <- BiocFileCache(ask=FALSE)
stuff <- bfcrpath(bfc, file.path("http://cf.10xgenomics.com",
    "samples/cell-exp/3.0.0/pbmc_10k_protein_v3",
    "pbmc_10k_protein_v3_filtered_feature_bc_matrix.tar.gz"))
untar(stuff, exdir=tempdir())

# Loading it in as a SingleCellExperiment object.
library(DropletUtils)
sce <- read10xCounts(file.path(tempdir(), "filtered_feature_bc_matrix"))

## ------------------------------------------------------------------------
sce <- splitAltExps(sce, rowData(sce)$Type)
altExpNames(sce)
altExp(sce) # Can be used like any other SingleCellExperiment. 

counts(altExp(sce)) <- as.matrix(counts(altExp(sce)))
counts(altExp(sce))[,1:10] # sneak peek

## ------------------------------------------------------------------------
library(scater)
mito <- grep("^MT-", rowData(sce)$Symbol)
df <- perCellQCMetrics(sce, subsets=list(Mito=mito))
mito.discard <- isOutlier(df$subsets_Mito_percent, type="higher")

ab.detected <- df$`altexps_Antibody Capture_detected`
med.detected <- median(ab.detected)
threshold <- med.detected/2
ab.discard <- ab.detected < threshold

discard <- ab.discard | mito.discard
sce <- sce[,!discard]

## ------------------------------------------------------------------------
library(DelayedMatrixStats)
# TODO: move into the DropletUtils package.
ambient <- rowMeans(counts(altExp(sce)))
sf.amb <- colMedians(counts(altExp(sce))/ambient)
sf.amb <- sf.amb/mean(sf.amb)

## ------------------------------------------------------------------------
sizeFactors(altExp(sce)) <- sf.amb
sce <- logNormCounts(sce, use_altexps=TRUE)

## ------------------------------------------------------------------------
library(MOFA)
mobj <- createMOFAobject(list(genes=logcounts(sce),
    tags=logcounts(altExp(sce))))
mobj <- prepareMOFA(mobj)
mobj <- runMOFA(mobj)

This is hitting swap on my 16 GB laptop, and there's only 8000 cells involved. I'm a bit bemused about why this is occurring - even densifying the larger matrix should only cost ~2GB.

Session info

R version 3.6.0 Patched (2019-05-02 r76458)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS

Matrix products: default
BLAS:   /home/luna/Software/R/R-3-6-branch-dev/lib/libRblas.so
LAPACK: /home/luna/Software/R/R-3-6-branch-dev/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] MOFA_1.1.1                   DelayedMatrixStats_1.7.2    
 [3] scater_1.13.27               ggplot2_3.2.1               
 [5] DropletUtils_1.5.12          SingleCellExperiment_1.7.11 
 [7] SummarizedExperiment_1.15.10 DelayedArray_0.11.8         
 [9] BiocParallel_1.19.5          matrixStats_0.55.0          
[11] Biobase_2.45.1               GenomicRanges_1.37.17       
[13] GenomeInfoDb_1.21.2          IRanges_2.19.18             
[15] S4Vectors_0.23.25            BiocGenerics_0.31.6         
[17] BiocFileCache_1.9.1          dbplyr_1.4.2                

loaded via a namespace (and not attached):
 [1] viridis_0.5.1               httr_1.4.1                 
 [3] edgeR_3.27.14               BiocSingular_1.1.7         
 [5] jsonlite_1.6                foreach_1.4.7              
 [7] bit64_0.9-7                 viridisLite_0.3.0          
 [9] R.utils_2.9.0               assertthat_0.2.1           
[11] dqrng_0.2.1                 blob_1.2.0                 
[13] GenomeInfoDbData_1.2.2      vipor_0.4.5                
[15] ggrepel_0.8.1               corrplot_0.84              
[17] pillar_1.4.2                RSQLite_2.1.2              
[19] backports_1.1.5             lattice_0.20-38            
[21] reticulate_1.13             glue_1.3.1                 
[23] limma_3.41.18               digest_0.6.22              
[25] RColorBrewer_1.1-2          XVector_0.25.0             
[27] colorspace_1.4-1            cowplot_1.0.0              
[29] plyr_1.8.4                  Matrix_1.2-17              
[31] R.oo_1.22.0                 pkgconfig_2.0.3            
[33] pheatmap_1.0.12             zlibbioc_1.31.0            
[35] purrr_0.3.3                 scales_1.0.0               
[37] HDF5Array_1.13.11           MultiAssayExperiment_1.11.9
[39] tibble_2.1.3                withr_2.1.2                
[41] lazyeval_0.2.2              magrittr_1.5               
[43] crayon_1.3.4                memoise_1.1.0              
[45] R.methodsS3_1.7.1           doParallel_1.0.15          
[47] beeswarm_0.2.3              tools_3.6.0                
[49] stringr_1.4.0               Rhdf5lib_1.7.6             
[51] munsell_0.5.0               locfit_1.5-9.1             
[53] irlba_2.3.3                 compiler_3.6.0             
[55] rsvd_1.0.2                  rlang_0.4.0                
[57] rhdf5_2.29.6                grid_3.6.0                 
[59] RCurl_1.95-4.12             iterators_1.0.12           
[61] BiocNeighbors_1.3.5         rappdirs_0.3.1             
[63] bitops_1.0-6                codetools_0.2-16           
[65] gtable_0.3.0                DBI_1.0.0                  
[67] curl_4.2                    reshape2_1.4.3             
[69] R6_2.4.0                    gridExtra_2.3              
[71] dplyr_0.8.3                 bit_1.1-14                 
[73] zeallot_0.1.0               stringi_1.4.3              
[75] ggbeeswarm_0.6.0            Rcpp_1.0.2                 
[77] vctrs_0.2.0                 tidyselect_0.2.5

Ricard Argelaguet · Answer 1 · Thu Oct 24 2019 17:57:06 GMT+0800 (China Standard Time)

Hi Aaron,
MOFA v1 was not aimed at >1000 samples. You will find not just memory but also speed issues.
We are about to release the new version that is tailored to single-cell data and which should significantly improve this. I'll invite you to the repository

Aaron Lun · Answer 2 · Thu Oct 24 2019 23:55:28 GMT+0800 (China Standard Time)

I'll continue this discussion off-line.

wyattmcdonnell · Answer 3 · Wed Nov 06 2019 08:40:02 GMT+0800 (China Standard Time)

@rargelaguet could you invite me to that repo as well? thanks!

Ricard Argelaguet · Answer 4 · Wed Nov 06 2019 16:03:02 GMT+0800 (China Standard Time)

I am still preparing the vignettes/documentation, I will release it in the next couple of days. Can it wait until Monday next week?

wyattmcdonnell · Answer 5 · Fri Nov 08 2019 04:53:49 GMT+0800 (China Standard Time)

how cool! yes, I'd be happy to wait until Monday of next week. thanks again @rargelaguet!