emmanuelparadis / ape

Hello,

we're seeing a cannot allocate memory block of size 134217728 Tb error when running nj with more than n=46341 sequences, which seems to be because for larger n, n * (n - 1) becomes larger than the maximum 32-bit signed integer and what's passed to R_alloc in

ape/src/nj.c

Line 75 in a99bc96

new_dist = (double*)R_alloc(n * (n - 1) / 2, sizeof(double));

is garbage.

It's trivial to increase the maximum to n=65536 because until then n * (n - 1) / 2 is still smaller than the maximum 32-bit signed integer, but for even larger n I suppose the code would require further changes, as the type of the indexes would also need to be changed.

(I see this recent commit, 4a1f232, so I suppose this doesn't come as a surprise to anyone :) but I'm also wondering if it's even feasible, in terms of computational time, to run nj with so many sequences?)

Cheers,
Miguel

Hi,
One way to fix this could be to test if n is even|odd and perform the division by 2 first with the even value (n or n-1), but this would work with n <= 65536 as you rightly pointed out.
Objects of class "dist" can be larger than 2.1 billion elements, here's a simple exemple:

R> x <- rnorm(7e4)
R> d <- dist(x)
R> str(length(d))
 num 2.45e+09
R> print(object.size(d), unit = "Gb")
18.3 Gb

At the C-side, the length of d would be handled with something like:

double L = XLENGTH(d);

so no need to do n * (n - 1) / 2, and the indices running along d would be 64-bit integers (aka long). I think that's the best solution, so the limits would be what your machine can store in memory.

As for running times, in my experience this function scales with O(n^3) (which is, if I remember correctly, the expectation from the NJ algorithm).

Cheers,
Emmanuel

To increase the limit to 65536, we simply patched the code to store n * (n - 1) in an intermediate long variable before dividing by 2 and casting back to int. Since the dataset where this occured has about 56k sequences this was enough to solve the memory allocation error, it's currently running, let's see how long it takes / if it ever finishes 😅

Regarding the generic solution, yes, I think you're quite right

Cheers,
Miguel

Done and pushed here. I've incremented the version number for the record.
Cheers,

Hi Miguel,
to speed things up for large trees. fastME can be a bit faster than NJ, especially id you do not perform tree rearrangements. You might also want to check if you have duplicated sequences. You can remove these and add them later to the tree again.
Regards,
Klaus

thanks @KlausVigo. I'm not the actual end user, but I relayed your suggestion :)

I'm thinking that it should be trivial to add some parallelization to nj - from what I see, most of the time is spent in sum_dist_to_i, so even just adding e.g.

--- src/nj.c.orig	
+++ src/nj.c	
@@ -92,6 +92,7 @@
     k = 0;
 
     while (n > 3) {
+#pragma omp parallel for default(shared)
 	for (i = 1; i <= n; i++) /* S[0] is not used */
 	    S[i] = sum_dist_to_i(n, D, i);
 
--- src/Makevars.orig	
+++ src/Makevars	
@@ -1 +1,2 @@
 PKG_LIBS = $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS)
+PKG_CFLAGS += -fopenmp

seems to already help (there would be a lot to further optimise, e.g. the openmp schedule, memory access patterns, and there are other loops there than can also be parallelised, but this seems to be the lowest hanging fruit)

is there a downside to this?

thanks @KlausVigo. I'm not the actual end user, but I relayed your suggestion :)

I'm thinking that it should be trivial to add some parallelization to nj - from what I see, most of the time is spent in sum_dist_to_i, so even just adding e.g.
--- src/nj.c.orig	
+++ src/nj.c	
@@ -92,6 +92,7 @@
     k = 0;
 
     while (n > 3) {
+#pragma omp parallel for default(shared)
 	for (i = 1; i <= n; i++) /* S[0] is not used */
 	    S[i] = sum_dist_to_i(n, D, i);
 
--- src/Makevars.orig	
+++ src/Makevars	
@@ -1 +1,2 @@
 PKG_LIBS = $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS)
+PKG_CFLAGS += -fopenmp
seems to already help (there would be a lot to further optimise, e.g. the openmp schedule, memory access patterns, and there are other loops there than can also be parallelised, but this seems to be the lowest hanging fruit)

is there a downside to this?

The downside is that parallelization depend from the hardware and the operating system of the user. One wants to give the user the chance to change the number of cores he wants to use. CRAN packages are allowed only using 2 cores by default as they test several packages on their servers. But see https://cran.r-project.org/doc/manuals/r-devel/R-exts.html#OpenMP-support for what CRAN has to say to openmp.

@KlausVigo all valid concerns (and thanks for that link), especially around thread safety, but otoh, R is often linked with threaded libraries (at least the BLAS ones, e.g. OpenBLAS, MKL, etc.), imho this wouldn't be much different...

Where I'm coming from is that in modern workloads R is often (most of the time?) using a lot more memory than the available memory per core, and in HPC settings (you can guess where I fit in all this) that can mean a lot of wasted cores.

Even when R packages are parallelised, it's often with multiprocessing which doesn't really help with this issue (at least not without explicitly distributing memory between processes), it will simply use even more memory, if available, and still waste cores, so multithreading would be ideal from that point of view, since threads share memory

anyway, these are just my 2 (sing)cents :)

Error in nj(x) : cannot allocate memory block of size 134217728 Tb