maminrayej / prepona

A graph crate with simplicity in mind

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Indexing rather than hashing in DefaultIdMap

redbug312 opened this issue · comments

I'm inspecting into VF2 algorithm recently. The IdMap takes lots of execution time as hashing is way too expensive. The algorithm has to visit nodes multiple times, which increases the penalty.

Index-based lookups seem to solve the problem. I had experimented with the following implementation and main() now takes only half of the time (36s to 18s in debug mode, 2.3s to 1.0s in release mode).

The drawback is that NodeId must be (usize) in the future, but the tradeoff is worthy. By the same approach, replacing IndexMap with Vec can further speed it up.

pub struct DefaultIdMap {
    real_to_virt: Vec<Option<usize>>,
    virt_to_real: Vec<NodeId>,
}
fn main() {
    let complete_graph_size = 8;
    let path_graph_size = 10;

    let path_graph: AdjMap<Undirected> = PathGraph::init(path_graph_size).generate();
    let lollipop_graph: AdjMap<Undirected> =
        LollipopGraph::init(complete_graph_size, path_graph_size).generate();

    let are_isomorph =
        VF2Isomorphism::init(&lollipop_graph, &path_graph, IsomorphismType::Subgraph).execute();
    assert!(are_isomorph)
}

Also, I think it is fine to translate between virtual/real ids for memory efficiency. I only feel the trait NodeIdMapProvider is odd because it just relies on NodeProvider. Instead of traits, IdMap::new(graph: impl NodeProvider) seems enough for the usage.

Thanks for inspecting the problem with the VF2 algorithm :)

The problem I had with real_to_virt: Vec<Option<usize>> is the amount of wasted memory it holds on to if the graph is grown and then shrunk. For example if the user adds 1000 nodes to the graph and then deletes 500 nodes, it's my understanding that this approach will hold on to the deleted nodes and set None in their place.
Or I'm completely missing the point here. If not, I acknowledge the performance gained by indexing but I'm afraid the price we pay in terms of wasted memory is too high.

This approach makes sense if there is no need to keep the NodeIds stable across deletions or for graphs that only implement AddNodeProvider and hence only grow and do not shrink, but as a general solution I'm not sure.

I was looking for alternative solutions and I found the reddit thread. It mentions the Judy array, a specialized radix tree, is memory-efficient and free from hashing. I've implemented one with rudy crate to compare with the previous ones.

pub struct HashIdMap {
    real_to_virt: HashMap<NodeId, usize>,
    virt_to_real: Vec<NodeId>,
}

pub struct VecIdMap {
    real_to_virt: Vec<Option<usize>>,
    virt_to_real: Vec<NodeId>,
}

pub struct RudyIdMap {
    real_to_virt: RudyMap<NodeId, usize>,
    virt_to_real: Vec<NodeId>,
}

For memory usage comparison, node ids 500..1000 and 5000..10000 are mapped. The sizes are inspected using deepsize crate as in the gist. Judy-array IdMap outperforms the others.

HashIdMap VecIdMap RudyIdMap
500..1000 18.4kB 20.0kB 11.2kB
5000..10000 154.7kB 200.0kB 84.9kB

For execution time comparison, the same VF2 main function is tested. Judy-array IdMap is faster than Hashmap but still falls behind the Vec one.

HashIdMap VecIdMap RudyIdMap
debug mode 36s 18s 30s
release mode 2.3s 1.0s 1.3s

Judy array seemed to fit best for sparse node ids and can be used as a default candidate. I think the algorithms still need to generalize over IdMaps, so users can switch to Vec-based IdMap if there are no deletions and execution time values.

Thank you for your thorough investigation!

The performance gained by using RudyIdMap is okay, but its memory footprint is impressive.
I agree it's a strong candidate to be the default IdMap. And yes algorithms should be decoupled from lower layer implementations as much as possible.

So this should be definition of the IdMap trait, right?

pub trait IdMap:
    Index<Self::VirtId, Output = Self::RealId> + Index<Self::RealId, Output = Self::VirtId>
{
    type VirtId;

    type RealId;

    fn idmap(g: &impl NodeProvider) -> Self;
}

And we can get rid of the NodeIdMapProvider.

Would you like to work on implementing it?

Thanks and I'm glad to take the work.

I agree with the trait definition, but I'd prefer the function name new to have it DefaultIdMap::new(graph). If successful, I can push a PR in couple of days.