Indexing rather than hashing in DefaultIdMap
redbug312 opened this issue · comments
I'm inspecting into VF2 algorithm recently. The IdMap takes lots of execution time as hashing is way too expensive. The algorithm has to visit nodes multiple times, which increases the penalty.
Index-based lookups seem to solve the problem. I had experimented with the following implementation and main()
now takes only half of the time (36s to 18s in debug mode, 2.3s to 1.0s in release mode).
The drawback is that NodeId
must be (usize)
in the future, but the tradeoff is worthy. By the same approach, replacing IndexMap
with Vec
can further speed it up.
pub struct DefaultIdMap {
real_to_virt: Vec<Option<usize>>,
virt_to_real: Vec<NodeId>,
}
fn main() {
let complete_graph_size = 8;
let path_graph_size = 10;
let path_graph: AdjMap<Undirected> = PathGraph::init(path_graph_size).generate();
let lollipop_graph: AdjMap<Undirected> =
LollipopGraph::init(complete_graph_size, path_graph_size).generate();
let are_isomorph =
VF2Isomorphism::init(&lollipop_graph, &path_graph, IsomorphismType::Subgraph).execute();
assert!(are_isomorph)
}
Also, I think it is fine to translate between virtual/real ids for memory efficiency. I only feel the trait NodeIdMapProvider
is odd because it just relies on NodeProvider
. Instead of traits, IdMap::new(graph: impl NodeProvider)
seems enough for the usage.
Thanks for inspecting the problem with the VF2 algorithm :)
The problem I had with real_to_virt: Vec<Option<usize>>
is the amount of wasted memory it holds on to if the graph is grown and then shrunk. For example if the user adds 1000 nodes to the graph and then deletes 500 nodes, it's my understanding that this approach will hold on to the deleted nodes and set None
in their place.
Or I'm completely missing the point here. If not, I acknowledge the performance gained by indexing but I'm afraid the price we pay in terms of wasted memory is too high.
This approach makes sense if there is no need to keep the NodeId
s stable across deletions or for graphs that only implement AddNodeProvider
and hence only grow and do not shrink, but as a general solution I'm not sure.
I was looking for alternative solutions and I found the reddit thread. It mentions the Judy array, a specialized radix tree, is memory-efficient and free from hashing. I've implemented one with rudy crate to compare with the previous ones.
pub struct HashIdMap {
real_to_virt: HashMap<NodeId, usize>,
virt_to_real: Vec<NodeId>,
}
pub struct VecIdMap {
real_to_virt: Vec<Option<usize>>,
virt_to_real: Vec<NodeId>,
}
pub struct RudyIdMap {
real_to_virt: RudyMap<NodeId, usize>,
virt_to_real: Vec<NodeId>,
}
For memory usage comparison, node ids 500..1000
and 5000..10000
are mapped. The sizes are inspected using deepsize crate as in the gist. Judy-array IdMap outperforms the others.
HashIdMap | VecIdMap | RudyIdMap | |
---|---|---|---|
500..1000 | 18.4kB | 20.0kB | 11.2kB |
5000..10000 | 154.7kB | 200.0kB | 84.9kB |
For execution time comparison, the same VF2 main
function is tested. Judy-array IdMap is faster than Hashmap but still falls behind the Vec one.
HashIdMap | VecIdMap | RudyIdMap | |
---|---|---|---|
debug mode | 36s | 18s | 30s |
release mode | 2.3s | 1.0s | 1.3s |
Judy array seemed to fit best for sparse node ids and can be used as a default candidate. I think the algorithms still need to generalize over IdMaps, so users can switch to Vec-based IdMap if there are no deletions and execution time values.
Thank you for your thorough investigation!
The performance gained by using RudyIdMap
is okay, but its memory footprint is impressive.
I agree it's a strong candidate to be the default IdMap
. And yes algorithms should be decoupled from lower layer implementations as much as possible.
So this should be definition of the IdMap
trait, right?
pub trait IdMap:
Index<Self::VirtId, Output = Self::RealId> + Index<Self::RealId, Output = Self::VirtId>
{
type VirtId;
type RealId;
fn idmap(g: &impl NodeProvider) -> Self;
}
And we can get rid of the NodeIdMapProvider
.
Would you like to work on implementing it?
Thanks and I'm glad to take the work.
I agree with the trait definition, but I'd prefer the function name new
to have it DefaultIdMap::new(graph)
. If successful, I can push a PR in couple of days.