Data.Graph.bcc is not efficient

Question

Data.Graph.bcc is not efficient

meooow25 opened this issue a year ago · comments

The collect function recursively concats on a tree, making the time complexity quadratic. This can be avoided using difference lists.
The do_label step builds some unnecessary intermediate trees, similar to the dfs we had (#882).

I'm not aware of common use cases for bcc, so I'm not sure if this affects anyone.
But as long we have it, we should make it efficient.

I can send a PR.

David Feuer · Answer 1 · Sat Jan 21 2023 14:21:45 GMT+0800 (China Standard Time)

Everything should be efficient if it can be, yes! Thanks.

David Feuer · Answer 2 · Sat Jan 21 2023 14:25:40 GMT+0800 (China Standard Time)

I can't make head or tail of the current algorithm. Is it explained in the paper? Please comment your version liberally; this doesn't seem likely to be obvious.

Soumik Sarkar · Answer 3 · Sat Jan 21 2023 14:43:13 GMT+0800 (China Standard Time)

Agreed that the code is not descriptive at all. It is explained the paper though. I'll make sure to add comments to make things a bit clearer 👍

David Feuer · Answer 4 · Tue Jan 31 2023 07:08:17 GMT+0800 (China Standard Time)

The new bcc code still uses forest twice, successively (i.e., not interleaved). This strikes me as rather bad. King and Launchbury pushed for lazy dfs; Tarjan's paper does ... something else. I think the question you raise is a good one: what can we do to help manage complexity without realizing too much forest? Lazy dfs can help in some cases, but for bcc we have to make sure the lazily produced depth-first forest isn't shared with what's used to build an array.

Soumik Sarkar · Answer 5 · Tue Jan 31 2023 23:16:25 GMT+0800 (China Standard Time)

I'm not sure I follow your comment. We do not have lazy dfs, so we get the full forest when we run dff. If we traverse it twice there is no extra memory cost. But if the question is whether we can avoid the time cost of traversing it twice, then we can think about it. It's possible to traverse it once but we need arbitrary dnum lookups when collecting, which means mutable arrays to keep the same complexity. If we go one step further and combine the dfs with collecting the result, that makes it very close to Tarjan's algorithm.

David Feuer · Answer 6 · Tue Jan 31 2023 23:48:59 GMT+0800 (China Standard Time)

We can achieve semi-lazy dfs (lazy in preorder) using lazy ST.

David Feuer · Answer 7 · Thu Feb 02 2023 11:48:31 GMT+0800 (China Standard Time)

Here's another thought for a place to start: can we get rid of the dnum array? Everything's a bit tangled, but it looks to me like we're almost certainly traversing the tree in depth-first order. So instead of checking an array to figure out where we are, we can keep track as we go. Something like

bicomps :: Int -> Tree Vertex -> (Int, Forest [Vertex])
collect :: Int -> Tree Vertex -> (Int, Int, ...)

Each function takes the current preorder index and returns the new one.

This is obviously ... unpleasant. That brings us back to the general DFS abstraction question.