emmanuelparadis / ape

analysis of phylogenetics and evolution

Home Page:http://ape-package.ird.fr/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ape::cophenetic.phylo() - Tree too big - any workarounds?

alptaciroglu opened this issue · comments

Hi,

I am trying to calculate pairwise tree distance in a large phylogenetic tree. As intended, I am getting the "Tree too big" error. However, I really need this distance matrix so I was trying to see if I can manage this using some modified scripts from the "ape" package.

I would like to kindly ask if ape can support larger trees. If not, I would appreciate it if you could just hint to me about what is going wrong with my own code (reproducible example attached below) Thanks a lot in advance.

#### Original ape scripts

example_large_tree <- ape::read.tree('example_large.tree')
example_large_tree_dist <- ape::cophenetic.phylo(example_large_tree)
Error in dist.nodes(x) : tree too big
#### Modified ape scripts

dist.nodes.mod <- function (x)
{
    x <- reorder(x)
    n <- Ntip(x)
    m <- x$Nnode
    nm <- n + m

    if (nm > floor(sqrt(2^40 - 1))) ## I changed the initial value to 2^40 to support larger trees
        stop("tree too big")
    d <- .Call(dist_nodes, as.integer(n), as.integer(m), as.integer(x$edge[, ## I changed the original ".C" to ".Call" following some posts from stackoverflow
        1] - 1L), as.integer(x$edge[, 2] - 1L), as.double(x$edge.length),
        as.integer(Nedge(x)), double(nm ^ 2), NAOK = TRUE)[[7]]  ## I changed the initial double(nm * nm) to double(nm ^ 2)

    dim(d) <- c(nm, nm)
    dimnames(d) <- list(1:nm, 1:nm)
    d
}

environment(dist.nodes.mod) <- asNamespace('ape')
assignInNamespace("dist.nodes", dist.nodes.mod, ns = "ape")

example_large_tree <- read.tree('example_large.tree')
example_large_tree_dist <- ape::cophenetic.phylo(example_large_tree)

Error in .Call(dist_nodes, as.integer(n), as.integer(m), as.integer(x$edge[,  : 
  NULL value passed as symbol address
Calls: <Anonymous> -> dist.nodes -> .Call
Execution halted

example_large_tree.zip

Hi,

1/ Do you really want to have the full cophenetic matrix? Your tree has 154656 tips, so this will require at least 96 GB of memory to store the result in an object of class "dist" (because the cophenetic distance is symmetric, we can keep only the lower triangle).

And depending on what operations you want to do on this object in R, you may need much more memory (e.g., printing a "dist" object creates a temporary matrix which requires double this quantity of memory).

2/ The current code in ape first computes the full cophenetic distance matrix among all nodes and tips of the tree and returning a square (symmetric) matrix of size 8*(n+m)^2 bytes (at least). It will be complicated to change this because dist.nodes() is used in many other packages.

3/ The current C code is a bit "old" and this explains the upper limit on (n + m)^2 to be ~2.1e9. This could be improved but not by several orders of magnitude. Maybe this could help to reach the machine limits (i.e., quantity of RAM).

4/ You cannot change .C() to .Call() without adapting the C code (and recompiling of course).

5/ A (general) solution to compute cophenetic distances with very big trees could be to do this on a pairwise basis or on a subset of the tips. This could be done with functions in ape, for instance with your tree (tr), I could get the cophenetic distance between tips #1 and #2 with:

R> nodepath(tr, 1, 2) # takes a few seconds
[1]      1 154668 154669      2
R> which(tr$edge[, 2] == 1)
[1] 12
R> which(tr$edge[, 2] == 2)
[1] 14
R> which(tr$edge[, 2] == 154669)
[1] 13
R> tr$edge[12:14, ] # <- the edges we are looking for
       [,1]   [,2]
[1,] 154668      1
[2,] 154668 154669
[3,] 154669      2
R> sum(tr$edge.length[12:14])
[1] 1.74321

This is actually much more than what I hoped for. Thanks a lot! (I had been working on this for quite some time)

I was prepared to use up the RAM in my institute's farm but I think I can work with the general pairwise basis solution you described on "5/" May I suggest posting this on StackOverflow? (as I was looking for a solution to this and couldn't find it)

Please feel free to close this thread.

Best regards