Error in nj(x) : cannot allocate memory block of size 134217728 Tb
migueldiascosta opened this issue · comments
Hello,
we're seeing a cannot allocate memory block of size 134217728 Tb
error when running nj
with more than n=46341
sequences, which seems to be because for larger n
, n * (n - 1)
becomes larger than the maximum 32-bit signed integer and what's passed to R_alloc
in
Line 75 in a99bc96
It's trivial to increase the maximum to n=65536
because until then n * (n - 1) / 2
is still smaller than the maximum 32-bit signed integer, but for even larger n
I suppose the code would require further changes, as the type of the indexes would also need to be changed.
(I see this recent commit, 4a1f232, so I suppose this doesn't come as a surprise to anyone :) but I'm also wondering if it's even feasible, in terms of computational time, to run nj
with so many sequences?)
Cheers,
Miguel
Hi,
One way to fix this could be to test if n is even|odd and perform the division by 2 first with the even value (n or n-1), but this would work with n <= 65536 as you rightly pointed out.
Objects of class "dist" can be larger than 2.1 billion elements, here's a simple exemple:
R> x <- rnorm(7e4)
R> d <- dist(x)
R> str(length(d))
num 2.45e+09
R> print(object.size(d), unit = "Gb")
18.3 Gb
At the C-side, the length of d
would be handled with something like:
double L = XLENGTH(d);
so no need to do n * (n - 1) / 2
, and the indices running along d
would be 64-bit integers (aka long
). I think that's the best solution, so the limits would be what your machine can store in memory.
As for running times, in my experience this function scales with O(n^3) (which is, if I remember correctly, the expectation from the NJ algorithm).
Cheers,
Emmanuel
To increase the limit to 65536, we simply patched the code to store n * (n - 1)
in an intermediate long
variable before dividing by 2 and casting back to int
. Since the dataset where this occured has about 56k sequences this was enough to solve the memory allocation error, it's currently running, let's see how long it takes / if it ever finishes 😅
Regarding the generic solution, yes, I think you're quite right
Cheers,
Miguel
Done and pushed here. I've incremented the version number for the record.
Cheers,
Hi Miguel,
to speed things up for large trees. fastME can be a bit faster than NJ, especially id you do not perform tree rearrangements. You might also want to check if you have duplicated sequences. You can remove these and add them later to the tree again.
Regards,
Klaus
thanks @KlausVigo. I'm not the actual end user, but I relayed your suggestion :)
I'm thinking that it should be trivial to add some parallelization to nj
- from what I see, most of the time is spent in sum_dist_to_i
, so even just adding e.g.
--- src/nj.c.orig
+++ src/nj.c
@@ -92,6 +92,7 @@
k = 0;
while (n > 3) {
+#pragma omp parallel for default(shared)
for (i = 1; i <= n; i++) /* S[0] is not used */
S[i] = sum_dist_to_i(n, D, i);
--- src/Makevars.orig
+++ src/Makevars
@@ -1 +1,2 @@
PKG_LIBS = $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS)
+PKG_CFLAGS += -fopenmp
seems to already help (there would be a lot to further optimise, e.g. the openmp schedule
, memory access patterns, and there are other loops there than can also be parallelised, but this seems to be the lowest hanging fruit)
is there a downside to this?
thanks @KlausVigo. I'm not the actual end user, but I relayed your suggestion :)
I'm thinking that it should be trivial to add some parallelization to
nj
- from what I see, most of the time is spent insum_dist_to_i
, so even just adding e.g.--- src/nj.c.orig +++ src/nj.c @@ -92,6 +92,7 @@ k = 0; while (n > 3) { +#pragma omp parallel for default(shared) for (i = 1; i <= n; i++) /* S[0] is not used */ S[i] = sum_dist_to_i(n, D, i); --- src/Makevars.orig +++ src/Makevars @@ -1 +1,2 @@ PKG_LIBS = $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) +PKG_CFLAGS += -fopenmpseems to already help (there would be a lot to further optimise, e.g. the openmp
schedule
, memory access patterns, and there are other loops there than can also be parallelised, but this seems to be the lowest hanging fruit)is there a downside to this?
The downside is that parallelization depend from the hardware and the operating system of the user. One wants to give the user the chance to change the number of cores he wants to use. CRAN packages are allowed only using 2 cores by default as they test several packages on their servers. But see https://cran.r-project.org/doc/manuals/r-devel/R-exts.html#OpenMP-support for what CRAN has to say to openmp.
@KlausVigo all valid concerns (and thanks for that link), especially around thread safety, but otoh, R is often linked with threaded libraries (at least the BLAS ones, e.g. OpenBLAS, MKL, etc.), imho this wouldn't be much different...
Where I'm coming from is that in modern workloads R is often (most of the time?) using a lot more memory than the available memory per core, and in HPC settings (you can guess where I fit in all this) that can mean a lot of wasted cores.
Even when R packages are parallelised, it's often with multiprocessing which doesn't really help with this issue (at least not without explicitly distributing memory between processes), it will simply use even more memory, if available, and still waste cores, so multithreading would be ideal from that point of view, since threads share memory
anyway, these are just my 2 (sing)cents :)