natverse / nat

NeuroAnatomy Toolbox: An R package for the (3D) visualisation and analysis of biological image data, especially tracings of single neurons.

Home Page:https://natverse.org/nat/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

improve performance of simplify_neuron

dokato opened this issue · comments

I profiled the simplify_neuron function on a large neuron with 76296 vertices. The function took 10 sec to run, out of which Dijkstra took almost 1.5s, and then this apply around 6s in total (in a few iterations)

furthest_leaf_idx = which.max(apply(dd, 2, robust_max))

Screenshot 2021-06-23 at 16 43 03

Attached profviz object from the image above.
simplify_neuron_profile.rda.zip

Thanks! Some ideas to speed up large neurons, which get very slow.

  • Use sparse matrix for distance table
  • Use faster row processing method
  • Use segmentgraph rather than full ngraph for distance calculations (with intervening nodes as edge property)
  • Prune the shorter of each pair of terminal twigs. This may not work in every situation, so need to think of how this could be restricted appropriately. But suspect it would halve the number of nodes under consideration in most cases.

Ad 1) I tried with sparse distance matrix, but tbh it doesn't help much while using apply:

> dd[dd==Inf] = 0
> sparse_mat <- as(dd,"sparseMatrix")
> system.time(which.max(apply(sparse_mat, 2, robust_max)))
   user  system elapsed 
  3.838   1.286   5.134 

Ad 2) Here's benchmark for 3 functions.

mat <- matrix(runif(1e6),nrow=100)

benchmark("apply" = {
  apply(mat, 2, min, na.rm = T)
},"colMins" = {
  colMins(mat, na.rm = T)
},"pmin" = {
  do.call(pmin, lapply(1:nrow(mat), function(i) mat[i,]))
})

mat <- matrix(runif(1e6),nrow=10000)

benchmark("apply" = {
  apply(mat, 2, min, na.rm = T)
},"colMins" = {
  colMins(mat, na.rm = T)
},"pmin" = {
  do.call(pmin, lapply(1:nrow(mat), function(i) mat[i,]))
})

Looks like colMins is a winner.

Thanks for the investigations @dokato. These two need to go together

  • Use sparse matrix for distance table.
  • Use faster row processing method

as running apply on a sparse matrix just turns each row/col back into a dense vector.

The key problem when n is low is apply robust_max as shown by your original benchmarking. The reason for robust_max is that I used Inf as a signalling value. I don't exactly recall why I went with that over NA. If we can switch to NA then regular functions should work. I have toyed with the idea of including matrixStats as a dependency in the past, but managed to find an equally fast approach without it – I have tried quite hard to avoid adding unnecessary dependencies to the base nat package. Another idea that I actually had was to make a whole extension package (something like fastnat) that adds Rcpp compiled code to handle some core operations that are hard to vectorise in base R (I don't want compiled code in nat itself). But adding matrixStats as a dependency would certainly be simpler if that covers most of the opportunities for speed-ups.

Thanks, that makes sense with robust_max, but see my modification from #472 I only replace robust_max by igraph::distances, which throws the same results and is way faster (see my benchmark from this PR). The only thing it misses is this multiple tree warning, I can look into adding it back.

matrixStats however is used to replace a regular min, PTAL: https://github.com/natverse/nat/pull/472/files#diff-5faab5de536b048b9db9bb68dc86e6acd1538f5737c36bea0d6997be5ce33ca2R947-R948