Read10X() unnecessary slow as it requires Matrix::readMM()
freeseek opened this issue · comments
Motivation
function Read10X() from R/preprocessing.R reads the matrix.mtx file using the Matrix::readMM() function, even if it is designed to read the smaller files features.tsv and barcodes.tsv using data.table::fread() if the namespace data.table is available. However, Matrix::readMM() is really slow as it is not multi-threaded. If I replace the following line:
data <- readMM(file = matrix.loc)
with the following two lines:
dt <- data.table::fread(matrix.loc, sep = '\t', header = TRUE, skip = 1, colClasses = c('integer', 'integer', 'numeric'), data.table = FALSE)
data <- new('dgTMatrix', i = dt[,1] - 1L, j = dt[,2] - 1L, x = dt[,3], Dim = as.integer(c(names(dt)[1], names(dt)[2])))
The reading of the matrix.mtx file becomes multi-threaded and much faster (>4x faster on my 6-core laptop)
Feature Description
The Read10X() function from R/preprocessing.R should use data.table::fread() rather than Matrix:readMM() when reading a matrix.mtx file when the namespace data.table is available to achieve a significant speed improvement
Alternatives
No response
Hi @freeseek,
Not member of dev team but just wanted to chime in.
I'm having two issues replicating your issue/code.
With the first line of your code if run as above you get the following error:
colClasses= is an unnamed vector of types, length 3, but there are 1 columns in the input.
If I change it so that it's just numeric then it works albeit with this warning:
Warning message:
In data.table::fread("/PATH/matrix.mtx.gz", :
Attempt to override column 1 <<%metadata_json: {"software_version": "cellranger-5.0.0", "format_version": 2}>> of inherent type 'string' down to 'float64' ignored. Only overrides to a higher type are currently supported. If this was intended, please coerce to the lower type afterwards.
Then when trying to run your second line of code I get the following error:
Error in dt[, 1] - 1L : non-numeric argument to binary operator
Best,
Sam
Maybe the format of matrix.mtx.gz
is not as standard as I thought ... I am quite new to all of this. This is what I have for reference :
$ zcat matrix.mtx.gz | head
%%MatrixMarket matrix coordinate integer general
33999 184268 96378984
1 4 1
1 6 3
1 14 1
1 26 1
1 56 1
1 72 1
1 104 1
1 141 1
And the following commands run without errors:
$ R
> library(data.table)
data.table 1.15.4 using 6 threads (see ?getDTthreads). Latest news: r-datatable.com
> library(Matrix)
> matrix.loc <- 'matrix.mtx.gz'
> dt <- data.table::fread(matrix.loc, sep = '\t', header = TRUE, skip = 1, colClasses = c('integer', 'integer', 'numeric'), data.table = FALSE)
> data <- new('dgTMatrix', i = dt[,1] - 1L, j = dt[,2] - 1L, x = dt[,3], Dim = as.integer(c(names(dt)[1], names(dt)[2])))
My guess is that you have a different header in your matrix which is affecting fread. Maybe you could try using this code instead:
dt <- data.table::fread(cmd = paste('zgrep -v ^%', matrix.loc), sep = '\t', header = TRUE, colClasses = c('integer', 'integer', 'numeric'), data.table = FALSE)
data <- new('dgTMatrix', i = dt[,1] - 1L, j = dt[,2] - 1L, x = dt[,3], Dim = as.integer(c(names(dt)[1], names(dt)[2])))
This would be less portable though. Unfortunately data.table::fread() does not have option comment.char = "#"
like utils::read.table() does
Hi @freeseek,
So that code still errors because there is only 1 column in the input matrix file. Is the matrix file you are using to test this from Cell Ranger Output? I tested on both my personal files but also for better reprex used this public dataset from 10X (https://www.10xgenomics.com/datasets/5k-hgmm-5p-nextgem) and got the same issue.
Best,
Sam
I am unable to download data from the link. Can you show what the first lines of your matrix.mtx.gz file looks like?
user@comp_name sample_filtered_feature_bc_matrix % gunzip -c matrix.mtx.gz | head
%%MatrixMarket matrix coordinate integer general
%metadata_json: {"software_version": "cellranger-8.0.0", "format_version": 2}
72302 6082 30063008
78 1 1
184 1 1
231 1 1
552 1 1
631 1 1
765 1 1
953 1 1
And then when using modified code you posted:
> dt <- data.table::fread(cmd = paste('zgrep -v ^%', "~/Downloads/sample_filtered_feature_bc_matrix/matrix.mtx.gz"), sep = '\t', header = TRUE, colClasses = c('integer', 'integer', 'numeric'), data.table = FALSE)
Error in data.table::fread(cmd = paste("zgrep -v ^%", "~/Downloads/sample_filtered_feature_bc_matrix/matrix.mtx.gz"), :
colClasses= is an unnamed vector of types, length 3, but there are 1 columns in the input. To specify types for a subset of columns, you can use a named vector, list format, or specify types using select= instead of colClasses=. Please see examples in ?fread.
also including sessionInfo here just for reference in case maybe it's package version issue but don't think so.
> sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-apple-darwin20
Running under: macOS Monterey 12.7.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Seurat_5.2.1 SeuratObject_5.0.2 sp_2.2-0 lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.4 readr_2.1.5 tidyr_1.3.1
[11] tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0 tictoc_1.2.1 Matrix_1.7-2
loaded via a namespace (and not attached):
[1] deldir_2.0-4 pbapply_1.7-2 gridExtra_2.3 rlang_1.1.5 magrittr_2.0.3 RcppAnnoy_0.0.22 spatstat.geom_3.3-5
[8] matrixStats_1.5.0 ggridges_0.5.6 compiler_4.4.0 png_0.1-8 vctrs_0.6.5 reshape2_1.4.4 pkgconfig_2.0.3
[15] fastmap_1.2.0 utf8_1.2.4 promises_1.3.2 tzdb_0.4.0 jsonlite_1.9.0 goftest_1.2-3 later_1.4.1
[22] spatstat.utils_3.1-2 irlba_2.3.5.1 parallel_4.4.0 cluster_2.1.8 R6_2.6.1 ica_1.0-3 spatstat.data_3.1-4
[29] stringi_1.8.4 RColorBrewer_1.1-3 reticulate_1.41.0 spatstat.univar_3.1-1 parallelly_1.42.0 lmtest_0.9-40 scattermore_1.2
[36] Rcpp_1.0.14 tensor_1.5 future.apply_1.11.3 zoo_1.8-13 R.utils_2.13.0 sctransform_0.4.1.9001 httpuv_1.6.15
[43] splines_4.4.0 igraph_2.1.4 timechange_0.3.0 tidyselect_1.2.1 abind_1.4-8 rstudioapi_0.17.1 spatstat.random_3.3-2
[50] codetools_0.2-20 miniUI_0.1.1.1 spatstat.explore_3.3-4 listenv_0.9.1 lattice_0.22-6 plyr_1.8.9 shiny_1.10.0
[57] withr_3.0.2 ROCR_1.0-11 Rtsne_0.17 future_1.34.0 fastDummies_1.7.5 survival_3.8-3 polyclip_1.10-7
[64] fitdistrplus_1.2-2 pbmc3k.SeuratData_3.1.4 pillar_1.10.1 KernSmooth_2.23-26 plotly_4.10.4 generics_0.1.3 RcppHNSW_0.6.0
[71] hms_1.1.3 munsell_0.5.1 scales_1.3.0 globals_0.16.3 xtable_1.8-4 glue_1.8.0 lazyeval_0.2.2
[78] tools_4.4.0 data.table_1.17.0 RSpectra_0.16-2 RANN_2.6.2 dotCall64_1.2 cowplot_1.1.3 grid_4.4.0
[85] colorspace_2.1-1 nlme_3.1-167 patchwork_1.3.0 cli_3.6.4 spatstat.sparse_3.1-0 spam_2.11-1 viridisLite_0.4.2
[92] uwot_0.2.3 gtable_0.3.6 R.methodsS3_1.8.2 digest_0.6.37 progressr_0.15.1 ggrepel_0.9.6 htmlwidgets_1.6.4
[99] farver_2.1.2 htmltools_0.5.8.1 R.oo_1.27.0 lifecycle_1.0.4 httr_1.4.7 mime_0.12 MASS_7.3-64
>
Okay, I believe the main issue here is that my team rewrote the way the matrix.mtx.gz file is generated and did not quite nail exactly how Cell Ranger creates it. There also does not seem to be an explanation for how Cell Ranger writes that file, though here there is an example that looks just like yours. Try the following code instead:
dt <- data.table::fread(matrix.loc, sep = ' ', header = TRUE, skip = 2, colClasses = c('integer', 'integer', 'numeric'), data.table = FALSE)
data <- new('dgTMatrix', i = dt[,1] - 1L, j = dt[,2] - 1L, x = dt[,3], Dim = as.integer(c(names(dt)[1], names(dt)[2])))
This assumes that there are exactly two comment lines that start with %
and that space is the delimiter between columns, though it is not mandatory in the MatrixMarket format and so I would not know for sure whether that's what Cell Ranger does all the time
Hi @freeseek,
Gotcha thanks for checking this! Yes that code now does work with Cell Ranger files. It does appear at least for now (though I can't validate how far back in Cell Ranger versions this goes) that the header does seem set to that (https://github.com/10XGenomics/cellranger/blob/6ebad209b8354353b4a9ee3eed1cb248d102af88/lib/rust/cr_lib/src/stages/write_matrix_market.rs).
However, it seems that for safety/portability that maybe keeping it as readMM
in the Read10X
function while slightly slower is safer long run while Read10X_h5
provides speed?
Best,
Sam
The implementation I gave might need to be amended for full portability. This could work:
if (has_dt) {
skip_lines <- peek_lines <- 2
while (skip_lines == peek_lines) {
peek_lines <- peek_lines * 2
peek <- data.table::fread(matrix.loc, nrows = peek_lines, header = FALSE, fill = TRUE, sep = "\n", quote = "")
skip_lines <- sum(grepl(paste0("^%"), peek[[1]]))
}
dt <- data.table::fread(matrix.loc, header = TRUE, skip = skip_lines, colClasses = c('integer', 'integer', 'numeric'), data.table = FALSE)
data <- new('dgTMatrix', i = dt[,1] - 1L, j = dt[,2] - 1L, x = dt[,3], Dim = as.integer(c(names(dt)[1], names(dt)[2])))
remove(dt)
} else {
data <- readMM(file = matrix.loc)
}
This will peek into the MatrixMarket file until it understands how many comment lines need to be skipped. For a regular Cell Ranger file it will only peek once but if there are more comment lines it will peek more. This is also flexible with column delimiters, so if the table uses tab delimiters instead of spaces it will still be okay. This implementation is approximately three times faster on my laptop than using Matrix::readMM(). It makes me wonder ... if properly implemented, how much slower than Seurat::Read10X_h5() would Seurat::Read10X() really be?
Hi @freeseek,
Ya that looks like it would work. I would suggest putting it together in PR and Seurat team can decide on whether to implement.
Best,
Sam
Actually, I have realized the implementation I gave was too slow. This implementation is simpler and should be twice as fast:
if (has_dt) {
dt <- data.table::fread(file, header = TRUE, skip = 1, colClasses = c("integer", "integer", "numeric"), quote = "", data.table = FALSE)
data <- new("dgTMatrix", i = dt[,1] - 1L, j = dt[,2] - 1L, x = dt[,3], Dim = as.integer(c(names(dt)[1], names(dt)[2])))
remove(dt)
} else {
data <- Matrix::readMM(file)
}
It has only the minor drawback that some headers could throw data.table::fread off but as long as the headers were generated using Cell Ranger's code it will be fine. I will make a PR for this
Also as fyi, which while it wouldn't help with the read speed of the files, it does help with overall speed when reading lots of files at once (and if PR is submitted/accepted then it would get even faster).
I have package scCustomize that has customized wrappers around both Read10X
and Read10X_h5
that can parallelize the reading of files either from single directory of multiple sub-directories, along with other enhancements (See https://samuel-marsh.github.io/scCustomize/articles/Read_and_Write_Functions.html). It will append sample names and works even with files that have prefixes (i.e. those from GEO or elsewhere) which Read10X
doesn't do. The functions simply use mclapply
under the hood which can dramatically speed up overall read time when you have lots of files. They can either return list of matrices or can pass all of the files to a merge function in the package that will append sample names to the barcodes and return a single matrix.
Best,
Sam
I'll go ahead and close this issue for now since you are submitting PR.