DelayedArray and SVD questions

Question

DelayedArray and SVD questions

brgew opened this issue 4 years ago · comments

Hi,

I am considering using the BiocSingular, DelayedArray, and HDF5Array packages for initial processing of large single-cell data sets where I use on-disk storage of the expression matrices.

I wonder if you might be willing to answer some of my questions, and if so, where you prefer that I post them.

As some background, I have run timing tests in which I begin processing from an sci-RNA-seq counts matrix, estimate size factors, normalize counts, calculate column means and variances, and then run singular value decomposition. It became clear that the SVD is the bottleneck so I saved an object with the elements required for the SVD and limited timing tests to the SVD.

I ran the SVD using dgCMatrix sparse matrix passed to irlba::irlba in order to get a reference time.I followed this with tests in which I wrapped the dgCMatrix in a DelayedArray and passed it to irlba::irlba, to BiocSingular::runIrlbaSVD, and BiocSingular::runRandomSVD. I tried also using the HDF5Array::TENxMatrix as the DelayedArray seed. The run times for matrices wrapped in DelayedArray are substantially longer than the runs using sparse matrices. My biggest concern is that I may be running these tests incorrectly.

As a warning, I have a poor understanding of DelayedArrays so at least some of my questions may be basic.

Thank you.
Brent

Aaron Lun · Answer 1 · Tue Jun 02 2020 15:17:52 GMT+0800 (China Standard Time)

Great to see someone diving deep into the intricacies of our matrix representations.

It's unlikely that you're doing anything wrong. I'll focus on the wrapping of the dgCMatrix in the DelayedArray; this will be slower than a naked dgCMatrix as DelayedArray (specifically DelayedMatrix) multiplication will coerce the sparse matrix into a dense form in a block-by-block manner. This is not quite as stupid as it seems because the DelayedArray framework knows nothing about the backend; the only safe thing it can do is to fall back to a dense ordinary matrix in order to perform the multiplication, which allows it to work for any backend (file-backed, cloud, etc.).

The cost of this generality is that it is much less efficient than working with a naked sparse matrix. Having said that, there is infrastructure within DelayedArray to support dedicated operations on sparse backends. It just hasn't been joined up to the matrix multiplication, not because it's hard but because it just hasn't been a priority. The thinking has been that if you're wrapping a sparse matrix in a DelayedArray, then you already have the sparse matrix in memory, so why not just use it directly? I am hoping to put some more time into this now that we are starting to see file-backed sparse representations for which sparse-aware multiplication will be truly beneficial.

Now, BiocSingular uses the DelayedArray wrapping in its internals for another reason, and that's to take advantage of parallelized multiplication when BPPARAM is set appropriately. In theory this should be faster than serial execution on a naked dgCMatrix; in practice, because of the sparse-to-dense conversion, it's not. There is also considerable overhead from the parallelization that may offset any speed improvements; I would say that it is only a clear win for file-backed matrices, though again, this may further improve if we can skip the dense conversion.

Obviously if your matrix is file-backed, you have the extra penalty of reading stuff from disk. We are also forced to store data in dense form in HDF5 arrays, so the sparse-to-dense cost still applies here. The only point of note is to make sure that your chunks are square-ish, which ensures that the row and column accesses are reasonably efficient. (This should already be the default.)

brgew · Answer 2 · Wed Jun 17 2020 04:25:26 GMT+0800 (China Standard Time)

Hi,

I am grateful for your clear background descriptions of DelayedArray and BiocSingular, and additional details of DelayedArray wrapping dgCMatrix: I am gaining confidence in my testing and use of BiocSingular. (As an aside, I hoped that an in-memory dgCMatrix wrapped in a DelayedArray might not suffer significant performance penalties when submitted to functions such as irlba::irlba(). It would be sweet to avoid having essentially duplicated code for the DelayedArray and non-DA cases.)

I have still some questions.

Some particularly insistent questions involve the setAutoBlockSize() and getAutoBlockSize() functions in the DelayedArray package. I don't understand what I see; I suppose that I am abusing the packages.

First, I setAutoBlockSize(6.4e10) and ran runRandomSVD() on a relatively small sparse matrix. This finished successfully. Then I ran runRandomSVD() on a relatively large matrix. This failed with an error saying that automatic block length is too big and should be no larger than .Machine$integer.max. I wonder why the first run using the smaller data set did not fail. I see that the getAutoBlockSize() function tests the current block size but setAutoBlockSize() doesn't. The commands that I used are

** Run 1

library(Matrix)
library(DelayedArray)
library(BiocParallel)
library(BiocSingular)

setAutoBlockSize(6.4e10)
fnam <- 'matrix_1.mat'  # matrix_1 has 20222 rows, 35987 columns, and 16821230 non-zero elements
mat <- DelayedArray::DelayedArray( as( readMM(fnam), "dgCMatrix" ) )
k <- 20
res <- runRandomSVD(mat,k=k,deferred=TRUE,BPPARAM=MulticoreParam(workers=4))

This ran to completion.

** Run 2

library(Matrix)
library(DelayedArray)
library(BiocParallel)
library(BiocSingular)

setAutoBlockSize(6.4e10)
fnam <- 'matrix_2.mat'  # matrix_2 has 58347 rows, 292010 columns, and 501354699 non-zero elements
mat <- DelayedArray::DelayedArray( as( readMM(fnam), "dgCMatrix" ) )
k <- 20
res <- runRandomSVD(mat,k=k,deferred=TRUE,BPPARAM=MulticoreParam(workers=4))

This run failed with the message

Error in h(simpleError(msg, call)) :
  error in evaluating the argument 'x' in selecting a method for function 't': Automatic block length is too big. Blocks of length > .Machine$integer.max are not supported yet. Please reduce the automatic block length by reducing the
  automatic block size with setAutoBlockSize().

I ran the command getAutoBlockSize(), which returns
34,076,398,960

> getAutoBlockSize()
[1] 34076398960

I ran

> .Machine$integer.max
[1] 2147483647

(I am puzzled about the auto block size getting reset internally to a value > .Machine$integer.max.)

Moving along, I seem to misunderstand how memory is used by DelayedArray/BiocSingular. I imagined that setting auto block size to ~2 Gb would limit the memory used in runs; however, when I ran runIrlbaSVD() on the relatively large sparse matrix after setting setAutoBlockSize(2147483647) the process used the machine's 78 Gb of RAM and the kernel terminated it (OOM).

** Run 3

library(Matrix)
library(DelayedArray)
library(BiocParallel)
library(BiocSingular)

setAutoBlockSize(2147483647)
fnam <- 'matrix_2.mat'  # matrix_2 has 58347 rows, 292010 columns, and 501354699 non-zero elements
fnam <- '/home/brent/work/sciplex/jose_sanjay/mcf7_count_matrix.mtx.gz'
mat <- DelayedArray::DelayedArray( as( readMM(fnam), "dgCMatrix" ) )
k <- 20
res <- runIrlbaSVD(mat,k=k,deferred=TRUE,BPPARAM=MulticoreParam(workers=4))

This run failed. (When I rerun it using 2 workers it looks like each of the processes can use > 30 Gb of RAM.)

> sessionInfo()
R version 4.0.1 (2020-06-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 10 (buster)

Matrix products: default
BLAS:   /usr/local/R/R401_sse2/lib/R/lib/libRblas.so
LAPACK: /usr/local/R/R401_sse2/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] BiocSingular_1.4.0  BiocParallel_1.22.0 DelayedArray_0.14.0 IRanges_2.22.1      S4Vectors_0.26.0    BiocGenerics_0.34.0 matrixStats_0.56.0  Matrix_1.2-18

loaded via a namespace (and not attached):
[1] compiler_4.0.1  rsvd_1.0.3      Rcpp_1.0.4.6    grid_4.0.1      irlba_2.3.3     lattice_0.20-41

Incidentally, I noticed a paper entitled Out-of-Core Singular Value Decomposition arXiv: 1907.06470v1, which uses a block-oriented strategy and may work with DelayedArrays. Responding to my inquiry last month, an author, Vadim Demchik, wrote that ExB SVD could be open-sourced in a few months.

Aaron Lun · Answer 3 · Wed Jun 17 2020 15:52:39 GMT+0800 (China Standard Time)

(As an aside, I hoped that an in-memory dgCMatrix wrapped in a DelayedArray might not suffer significant performance penalties when submitted to functions such as irlba::irlba(). It would be sweet to avoid having essentially duplicated code for the DelayedArray and non-DA cases.)

This will soon be the case once the sparse matrix capabilities of DelayedArray are optimized.

First, I setAutoBlockSize(6.4e10)

Woah, woah. That's a block that 64 GB in size! Are you sure you want to do that?

It only "works" for a small sparse matrix because you have enough memory to realize that sparse matrix fully as a dense ordinary matrix. Which, of course, totally defeats the purpose.

I don't know why there's a .Machine$integer.max limit on the block size, but questions would need to be asked of why you're creating blocks of that size in the first place. Past a certain point, bigger blocks are not more efficient, especially if they require more work to allocate contiguous memory.

Perhaps @hpages may be able to shed some light on the technical details.

brgew · Answer 4 · Fri Jun 19 2020 01:23:50 GMT+0800 (China Standard Time)

Hi,

OK. I'll hold my horses until DelayedArray preserves sparsity.

Still I want to get a feeling for how the runIrlbaSVD() and runRandomSVD() compare and perform under various conditions. And I am looking at time/memory profiling strategies for R (I am not proficient in R, I haven't used it for development until recently).

That said, yup, I wanted a 64 GB block size but no longer because my concept of the block size function is clearly incongruous with reality. When loaded into a dgCMatrix wrapped with DelayedArray, the relatively large matrix requires 6.02 GB as reported by pryr::object_size(). I have 78 GB of RAM so I reasoned that allocating 64 GB to block operations might minimize the number of such operations. Maybe DelayedArray allocates more than one block concurrently? If so, I wonder how I might limit/control the memory usage. (In my experience so far, I have not found error messages to be illuminating when a BiocSingular SVD process fails, apparently when memory requirements, based on dmesg dumps, exceed some limit. I am concerned about this because we may use the SVD in a package used by others.)

Again, I appreciate your consideration and patience.
Thank you.

Hervé Pagès · Answer 5 · Fri Jun 19 2020 04:36:19 GMT+0800 (China Standard Time)

Hi,

block size != block length

block length = number of array elements in a block (prod(dim(block))).
block size = block length * size of the individual elements in memory.

For example, for an integer array, block size (in bytes) is going to be 4 x block length. For a numeric array (type == "double"), it's going to be 8 x block length.

In its current form, block processing in DelayedArray must decide the geometry of the blocks before starting the walk on the blocks. It does this based on several criteria. Two of them are:

the auto block size: maximum size (in bytes) of a block once loaded in memory
the type() of the array (e.g. integer, double, complex, etc...)

The auto block size setting and type() control the maximum length of the blocks. Other criteria control their shape. So for example if I set the auto block size to 8GB, this will cap the length of the blocks to 2e9 if my DelayedArray object is of type integer and to 1e9 if it's of type double.

Note that this simple relationship between block size and block length assumes that blocks are loaded in memory as ordinary (a.k.a. dense) matrices or arrays. With sparse blocks, all bets are off. But the max block length is always taken to be the auto block size divided by get_type_size(type()) whether the blocks are going to be loaded as dense or sparse arrays. If they are going to be loaded as sparse arrays, their memory footprint is very likely to be smaller than if they were loaded as dense arrays so this is safe (although probably not optimal).

It's important to keep in mind that the auto block size setting is a simple way for the user to put a cap on the memory footprint of the blocks. And that's all. In particular it doesn't control the maximum amount of memory used by the block processing algorithm. Other variables can impact dramatically memory usage like parallelization (where more than one block is loaded in memory at any given time), what the algorithm is doing with the blocks (e.g. something like blockApply(x, identity) will actually load the entire array data in memory), what delayed operations are on x, etc... It would be awesome to have a way to control the maximum amount of memory used by a block processing algorithm as a whole but I don't know how to do that.

Finally w.r.t. the .Machine$integer.max limit on the block length. This was an early decision to avoid all kinds of complications with blocks that are longer than this limit. Last time I checked an ordinary matrix or array in R could not be longer than that (maybe this has changed) so it would be impossible to load a dense block of that length in the first place. Also this was an easy way to protect many parts of the DelayedArray code against integer overflows. This could be revisited but it's not a simple change. Anyway I'm not convinced that using crazy block sizes is the way to address performance bottlenecks.

Hope this helps.

H.

brgew · Answer 6 · Fri Jun 19 2020 08:31:10 GMT+0800 (China Standard Time)

Hi @hpages,

I am grateful for your valuable description of definitions/usage of DelayedArray memory-related parameters. (I regret confusing block length and block size. I unfortunately did not read the error message,

automatic block length is too big. Blocks of length > .Machine$integer.max are not supported yet. Please reduce the automatic block length by reducing the
  automatic block size with setAutoBlockSize()

with adequate care.)

I believe that I read somewhere that R 3.0.0 introduced 'long vectors' with > 2^31-1 elements although my recollection is hazy now. (I seem to recall also that the maximum index value for a matrix is limited to 2^31-1.)

Ahh, here it is, Long Vectors. Or, perhaps I misunderstand you.

Anyway, I think that I understand better what I see, and I need certainly to think more carefully about these details. And I appreciate better some of the complexities you deal with in the DelayedArray package!
Thank you!

Hervé Pagès · Answer 7 · Fri Jun 19 2020 10:36:32 GMT+0800 (China Standard Time)

Correct, long vectors were introduced in R 3.0.0. However, in the early days you couldn't do much with them because very few operations in base R had been modified to support them. I believe things have improved significantly since then though so maybe all the base R operations that are needed to support blocks of length >= 2^31 are now capable to operate on long arrays. We would also need to make sure that the matrix summarization functions from the matrixStats package can also handle that.

brgew · Answer 8 · Fri Jun 26 2020 08:19:51 GMT+0800 (China Standard Time)

Hi @LTLA and @hpages,
I appreciate your insights into R and your work on the DelayedArray and BiocSingular packages.
I think that I have sufficient understanding and confidence to proceed with testing in the near future thanks to you so I am closing this issue.
I want to thank both of you for the patience and help.