bioFAM / MOFA

Multi-Omics Factor Analysis

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improving memory efficiency for CITE-seq data analysis

LTLA opened this issue · comments

I would like this to work better than it currently does:

## ------------------------------------------------------------------------
# Caching it locally with BiocFileCache to avoid repeating it.
library(BiocFileCache)
bfc <- BiocFileCache(ask=FALSE)
stuff <- bfcrpath(bfc, file.path("http://cf.10xgenomics.com",
    "samples/cell-exp/3.0.0/pbmc_10k_protein_v3",
    "pbmc_10k_protein_v3_filtered_feature_bc_matrix.tar.gz"))
untar(stuff, exdir=tempdir())

# Loading it in as a SingleCellExperiment object.
library(DropletUtils)
sce <- read10xCounts(file.path(tempdir(), "filtered_feature_bc_matrix"))

## ------------------------------------------------------------------------
sce <- splitAltExps(sce, rowData(sce)$Type)
altExpNames(sce)
altExp(sce) # Can be used like any other SingleCellExperiment. 

counts(altExp(sce)) <- as.matrix(counts(altExp(sce)))
counts(altExp(sce))[,1:10] # sneak peek

## ------------------------------------------------------------------------
library(scater)
mito <- grep("^MT-", rowData(sce)$Symbol)
df <- perCellQCMetrics(sce, subsets=list(Mito=mito))
mito.discard <- isOutlier(df$subsets_Mito_percent, type="higher")

ab.detected <- df$`altexps_Antibody Capture_detected`
med.detected <- median(ab.detected)
threshold <- med.detected/2
ab.discard <- ab.detected < threshold

discard <- ab.discard | mito.discard
sce <- sce[,!discard]

## ------------------------------------------------------------------------
library(DelayedMatrixStats)
# TODO: move into the DropletUtils package.
ambient <- rowMeans(counts(altExp(sce)))
sf.amb <- colMedians(counts(altExp(sce))/ambient)
sf.amb <- sf.amb/mean(sf.amb)

## ------------------------------------------------------------------------
sizeFactors(altExp(sce)) <- sf.amb
sce <- logNormCounts(sce, use_altexps=TRUE)

## ------------------------------------------------------------------------
library(MOFA)
mobj <- createMOFAobject(list(genes=logcounts(sce),
    tags=logcounts(altExp(sce))))
mobj <- prepareMOFA(mobj)
mobj <- runMOFA(mobj)

This is hitting swap on my 16 GB laptop, and there's only 8000 cells involved. I'm a bit bemused about why this is occurring - even densifying the larger matrix should only cost ~2GB.

Session info
R version 3.6.0 Patched (2019-05-02 r76458)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS

Matrix products: default
BLAS:   /home/luna/Software/R/R-3-6-branch-dev/lib/libRblas.so
LAPACK: /home/luna/Software/R/R-3-6-branch-dev/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] MOFA_1.1.1                   DelayedMatrixStats_1.7.2    
 [3] scater_1.13.27               ggplot2_3.2.1               
 [5] DropletUtils_1.5.12          SingleCellExperiment_1.7.11 
 [7] SummarizedExperiment_1.15.10 DelayedArray_0.11.8         
 [9] BiocParallel_1.19.5          matrixStats_0.55.0          
[11] Biobase_2.45.1               GenomicRanges_1.37.17       
[13] GenomeInfoDb_1.21.2          IRanges_2.19.18             
[15] S4Vectors_0.23.25            BiocGenerics_0.31.6         
[17] BiocFileCache_1.9.1          dbplyr_1.4.2                

loaded via a namespace (and not attached):
 [1] viridis_0.5.1               httr_1.4.1                 
 [3] edgeR_3.27.14               BiocSingular_1.1.7         
 [5] jsonlite_1.6                foreach_1.4.7              
 [7] bit64_0.9-7                 viridisLite_0.3.0          
 [9] R.utils_2.9.0               assertthat_0.2.1           
[11] dqrng_0.2.1                 blob_1.2.0                 
[13] GenomeInfoDbData_1.2.2      vipor_0.4.5                
[15] ggrepel_0.8.1               corrplot_0.84              
[17] pillar_1.4.2                RSQLite_2.1.2              
[19] backports_1.1.5             lattice_0.20-38            
[21] reticulate_1.13             glue_1.3.1                 
[23] limma_3.41.18               digest_0.6.22              
[25] RColorBrewer_1.1-2          XVector_0.25.0             
[27] colorspace_1.4-1            cowplot_1.0.0              
[29] plyr_1.8.4                  Matrix_1.2-17              
[31] R.oo_1.22.0                 pkgconfig_2.0.3            
[33] pheatmap_1.0.12             zlibbioc_1.31.0            
[35] purrr_0.3.3                 scales_1.0.0               
[37] HDF5Array_1.13.11           MultiAssayExperiment_1.11.9
[39] tibble_2.1.3                withr_2.1.2                
[41] lazyeval_0.2.2              magrittr_1.5               
[43] crayon_1.3.4                memoise_1.1.0              
[45] R.methodsS3_1.7.1           doParallel_1.0.15          
[47] beeswarm_0.2.3              tools_3.6.0                
[49] stringr_1.4.0               Rhdf5lib_1.7.6             
[51] munsell_0.5.0               locfit_1.5-9.1             
[53] irlba_2.3.3                 compiler_3.6.0             
[55] rsvd_1.0.2                  rlang_0.4.0                
[57] rhdf5_2.29.6                grid_3.6.0                 
[59] RCurl_1.95-4.12             iterators_1.0.12           
[61] BiocNeighbors_1.3.5         rappdirs_0.3.1             
[63] bitops_1.0-6                codetools_0.2-16           
[65] gtable_0.3.0                DBI_1.0.0                  
[67] curl_4.2                    reshape2_1.4.3             
[69] R6_2.4.0                    gridExtra_2.3              
[71] dplyr_0.8.3                 bit_1.1-14                 
[73] zeallot_0.1.0               stringi_1.4.3              
[75] ggbeeswarm_0.6.0            Rcpp_1.0.2                 
[77] vctrs_0.2.0                 tidyselect_0.2.5           

Hi Aaron,
MOFA v1 was not aimed at >1000 samples. You will find not just memory but also speed issues.
We are about to release the new version that is tailored to single-cell data and which should significantly improve this. I'll invite you to the repository

I'll continue this discussion off-line.

@rargelaguet could you invite me to that repo as well? thanks!

I am still preparing the vignettes/documentation, I will release it in the next couple of days. Can it wait until Monday next week?

how cool! yes, I'd be happy to wait until Monday of next week. thanks again @rargelaguet!