satijalab / seurat

R toolkit for single cell genomics

Home Page:http://www.satijalab.org/seurat

Repository from Github https://github.comsatijalab/seuratRepository from Github https://github.comsatijalab/seurat

Read10X() unnecessary slow as it requires Matrix::readMM()

freeseek opened this issue · comments

Motivation

function Read10X() from R/preprocessing.R reads the matrix.mtx file using the Matrix::readMM() function, even if it is designed to read the smaller files features.tsv and barcodes.tsv using data.table::fread() if the namespace data.table is available. However, Matrix::readMM() is really slow as it is not multi-threaded. If I replace the following line:

data <- readMM(file = matrix.loc)

with the following two lines:

dt <- data.table::fread(matrix.loc, sep = '\t', header = TRUE, skip = 1, colClasses = c('integer', 'integer', 'numeric'), data.table = FALSE)
data <- new('dgTMatrix', i = dt[,1] - 1L, j = dt[,2] - 1L, x = dt[,3], Dim = as.integer(c(names(dt)[1], names(dt)[2])))

The reading of the matrix.mtx file becomes multi-threaded and much faster (>4x faster on my 6-core laptop)

Feature Description

The Read10X() function from R/preprocessing.R should use data.table::fread() rather than Matrix:readMM() when reading a matrix.mtx file when the namespace data.table is available to achieve a significant speed improvement

Alternatives

No response

Hi @freeseek,

Not member of dev team but just wanted to chime in.

I'm having two issues replicating your issue/code.
With the first line of your code if run as above you get the following error:

colClasses= is an unnamed vector of types, length 3, but there are 1 columns in the input.

If I change it so that it's just numeric then it works albeit with this warning:

Warning message:
In data.table::fread("/PATH/matrix.mtx.gz",  :
  Attempt to override column 1 <<%metadata_json: {"software_version": "cellranger-5.0.0", "format_version": 2}>> of inherent type 'string' down to 'float64' ignored. Only overrides to a higher type are currently supported. If this was intended, please coerce to the lower type afterwards.

Then when trying to run your second line of code I get the following error:

Error in dt[, 1] - 1L : non-numeric argument to binary operator

Best,
Sam

Maybe the format of matrix.mtx.gz is not as standard as I thought ... I am quite new to all of this. This is what I have for reference :

$ zcat matrix.mtx.gz | head
%%MatrixMarket matrix coordinate integer general
33999	184268	96378984
1	4	1
1	6	3
1	14	1
1	26	1
1	56	1
1	72	1
1	104	1
1	141	1

And the following commands run without errors:

$ R
> library(data.table)
data.table 1.15.4 using 6 threads (see ?getDTthreads).  Latest news: r-datatable.com
> library(Matrix)
> matrix.loc <- 'matrix.mtx.gz'
> dt <- data.table::fread(matrix.loc, sep = '\t', header = TRUE, skip = 1, colClasses = c('integer', 'integer', 'numeric'), data.table = FALSE)
> data <- new('dgTMatrix', i = dt[,1] - 1L, j = dt[,2] - 1L, x = dt[,3], Dim = as.integer(c(names(dt)[1], names(dt)[2])))

My guess is that you have a different header in your matrix which is affecting fread. Maybe you could try using this code instead:

dt <- data.table::fread(cmd = paste('zgrep -v ^%', matrix.loc), sep = '\t', header = TRUE, colClasses = c('integer', 'integer', 'numeric'), data.table = FALSE)
data <- new('dgTMatrix', i = dt[,1] - 1L, j = dt[,2] - 1L, x = dt[,3], Dim = as.integer(c(names(dt)[1], names(dt)[2])))

This would be less portable though. Unfortunately data.table::fread() does not have option comment.char = "#" like utils::read.table() does

Hi @freeseek,

So that code still errors because there is only 1 column in the input matrix file. Is the matrix file you are using to test this from Cell Ranger Output? I tested on both my personal files but also for better reprex used this public dataset from 10X (https://www.10xgenomics.com/datasets/5k-hgmm-5p-nextgem) and got the same issue.

Best,
Sam

I am unable to download data from the link. Can you show what the first lines of your matrix.mtx.gz file looks like?

user@comp_name sample_filtered_feature_bc_matrix % gunzip -c matrix.mtx.gz | head
%%MatrixMarket matrix coordinate integer general
%metadata_json: {"software_version": "cellranger-8.0.0", "format_version": 2}
72302 6082 30063008
78 1 1
184 1 1
231 1 1
552 1 1
631 1 1
765 1 1
953 1 1

And then when using modified code you posted:

> dt <- data.table::fread(cmd = paste('zgrep -v ^%', "~/Downloads/sample_filtered_feature_bc_matrix/matrix.mtx.gz"), sep = '\t', header = TRUE, colClasses = c('integer', 'integer', 'numeric'), data.table = FALSE)
Error in data.table::fread(cmd = paste("zgrep -v ^%", "~/Downloads/sample_filtered_feature_bc_matrix/matrix.mtx.gz"),  : 
  colClasses= is an unnamed vector of types, length 3, but there are 1 columns in the input. To specify types for a subset of columns, you can use a named vector, list format, or specify types using select= instead of colClasses=. Please see examples in ?fread.

also including sessionInfo here just for reference in case maybe it's package version issue but don't think so.

> sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-apple-darwin20
Running under: macOS Monterey 12.7.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Seurat_5.2.1       SeuratObject_5.0.2 sp_2.2-0           lubridate_1.9.4    forcats_1.0.0      stringr_1.5.1      dplyr_1.1.4        purrr_1.0.4        readr_2.1.5        tidyr_1.3.1       
[11] tibble_3.2.1       ggplot2_3.5.1      tidyverse_2.0.0    tictoc_1.2.1       Matrix_1.7-2      

loaded via a namespace (and not attached):
  [1] deldir_2.0-4            pbapply_1.7-2           gridExtra_2.3           rlang_1.1.5             magrittr_2.0.3          RcppAnnoy_0.0.22        spatstat.geom_3.3-5    
  [8] matrixStats_1.5.0       ggridges_0.5.6          compiler_4.4.0          png_0.1-8               vctrs_0.6.5             reshape2_1.4.4          pkgconfig_2.0.3        
 [15] fastmap_1.2.0           utf8_1.2.4              promises_1.3.2          tzdb_0.4.0              jsonlite_1.9.0          goftest_1.2-3           later_1.4.1            
 [22] spatstat.utils_3.1-2    irlba_2.3.5.1           parallel_4.4.0          cluster_2.1.8           R6_2.6.1                ica_1.0-3               spatstat.data_3.1-4    
 [29] stringi_1.8.4           RColorBrewer_1.1-3      reticulate_1.41.0       spatstat.univar_3.1-1   parallelly_1.42.0       lmtest_0.9-40           scattermore_1.2        
 [36] Rcpp_1.0.14             tensor_1.5              future.apply_1.11.3     zoo_1.8-13              R.utils_2.13.0          sctransform_0.4.1.9001  httpuv_1.6.15          
 [43] splines_4.4.0           igraph_2.1.4            timechange_0.3.0        tidyselect_1.2.1        abind_1.4-8             rstudioapi_0.17.1       spatstat.random_3.3-2  
 [50] codetools_0.2-20        miniUI_0.1.1.1          spatstat.explore_3.3-4  listenv_0.9.1           lattice_0.22-6          plyr_1.8.9              shiny_1.10.0           
 [57] withr_3.0.2             ROCR_1.0-11             Rtsne_0.17              future_1.34.0           fastDummies_1.7.5       survival_3.8-3          polyclip_1.10-7        
 [64] fitdistrplus_1.2-2      pbmc3k.SeuratData_3.1.4 pillar_1.10.1           KernSmooth_2.23-26      plotly_4.10.4           generics_0.1.3          RcppHNSW_0.6.0         
 [71] hms_1.1.3               munsell_0.5.1           scales_1.3.0            globals_0.16.3          xtable_1.8-4            glue_1.8.0              lazyeval_0.2.2         
 [78] tools_4.4.0             data.table_1.17.0       RSpectra_0.16-2         RANN_2.6.2              dotCall64_1.2           cowplot_1.1.3           grid_4.4.0             
 [85] colorspace_2.1-1        nlme_3.1-167            patchwork_1.3.0         cli_3.6.4               spatstat.sparse_3.1-0   spam_2.11-1             viridisLite_0.4.2      
 [92] uwot_0.2.3              gtable_0.3.6            R.methodsS3_1.8.2       digest_0.6.37           progressr_0.15.1        ggrepel_0.9.6           htmlwidgets_1.6.4      
 [99] farver_2.1.2            htmltools_0.5.8.1       R.oo_1.27.0             lifecycle_1.0.4         httr_1.4.7              mime_0.12               MASS_7.3-64            
> 

Okay, I believe the main issue here is that my team rewrote the way the matrix.mtx.gz file is generated and did not quite nail exactly how Cell Ranger creates it. There also does not seem to be an explanation for how Cell Ranger writes that file, though here there is an example that looks just like yours. Try the following code instead:

dt <- data.table::fread(matrix.loc, sep = ' ', header = TRUE, skip = 2, colClasses = c('integer', 'integer', 'numeric'), data.table = FALSE)
data <- new('dgTMatrix', i = dt[,1] - 1L, j = dt[,2] - 1L, x = dt[,3], Dim = as.integer(c(names(dt)[1], names(dt)[2])))

This assumes that there are exactly two comment lines that start with % and that space is the delimiter between columns, though it is not mandatory in the MatrixMarket format and so I would not know for sure whether that's what Cell Ranger does all the time

Hi @freeseek,

Gotcha thanks for checking this! Yes that code now does work with Cell Ranger files. It does appear at least for now (though I can't validate how far back in Cell Ranger versions this goes) that the header does seem set to that (https://github.com/10XGenomics/cellranger/blob/6ebad209b8354353b4a9ee3eed1cb248d102af88/lib/rust/cr_lib/src/stages/write_matrix_market.rs).

However, it seems that for safety/portability that maybe keeping it as readMM in the Read10X function while slightly slower is safer long run while Read10X_h5 provides speed?

Best,
Sam

The implementation I gave might need to be amended for full portability. This could work:

if (has_dt) {
  skip_lines <- peek_lines <- 2
  while (skip_lines == peek_lines) {
    peek_lines <- peek_lines * 2
    peek <- data.table::fread(matrix.loc, nrows = peek_lines, header = FALSE, fill = TRUE, sep = "\n", quote = "")
    skip_lines <- sum(grepl(paste0("^%"), peek[[1]]))
  }
  dt <- data.table::fread(matrix.loc, header = TRUE, skip = skip_lines, colClasses = c('integer', 'integer', 'numeric'), data.table = FALSE)
  data <- new('dgTMatrix', i = dt[,1] - 1L, j = dt[,2] - 1L, x = dt[,3], Dim = as.integer(c(names(dt)[1], names(dt)[2])))
  remove(dt)
} else {
  data <- readMM(file = matrix.loc)
}

This will peek into the MatrixMarket file until it understands how many comment lines need to be skipped. For a regular Cell Ranger file it will only peek once but if there are more comment lines it will peek more. This is also flexible with column delimiters, so if the table uses tab delimiters instead of spaces it will still be okay. This implementation is approximately three times faster on my laptop than using Matrix::readMM(). It makes me wonder ... if properly implemented, how much slower than Seurat::Read10X_h5() would Seurat::Read10X() really be?

Hi @freeseek,

Ya that looks like it would work. I would suggest putting it together in PR and Seurat team can decide on whether to implement.

Best,
Sam

Actually, I have realized the implementation I gave was too slow. This implementation is simpler and should be twice as fast:

if (has_dt) {
  dt <- data.table::fread(file, header = TRUE, skip = 1, colClasses = c("integer", "integer", "numeric"), quote = "", data.table = FALSE)
  data <- new("dgTMatrix", i = dt[,1] - 1L, j = dt[,2] - 1L, x = dt[,3], Dim = as.integer(c(names(dt)[1], names(dt)[2])))
  remove(dt)
} else {
  data <- Matrix::readMM(file)
}

It has only the minor drawback that some headers could throw data.table::fread off but as long as the headers were generated using Cell Ranger's code it will be fine. I will make a PR for this

Also as fyi, which while it wouldn't help with the read speed of the files, it does help with overall speed when reading lots of files at once (and if PR is submitted/accepted then it would get even faster).

I have package scCustomize that has customized wrappers around both Read10X and Read10X_h5 that can parallelize the reading of files either from single directory of multiple sub-directories, along with other enhancements (See https://samuel-marsh.github.io/scCustomize/articles/Read_and_Write_Functions.html). It will append sample names and works even with files that have prefixes (i.e. those from GEO or elsewhere) which Read10X doesn't do. The functions simply use mclapply under the hood which can dramatically speed up overall read time when you have lots of files. They can either return list of matrices or can pass all of the files to a merge function in the package that will append sample names to the barcodes and return a single matrix.

Best,
Sam

I'll go ahead and close this issue for now since you are submitting PR.