greg7mdp / parallel-hashmap

A family of header-only, very fast and memory-friendly hashmap and btree containers.

Home Page:https://greg7mdp.github.io/parallel-hashmap/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Thread safe way to check existence of items

easypickings opened this issue · comments

I'm not quite familiar with thread safety issues, but suppose I have a hash map that has been initialized with some key-value pairs. Is it okay to store the data in a flat_hash_map and use STL-like ops to check if a key exists in the map with multi-threading (the data is assured not to be altered during the checking)?

Specifically, what I want to do is something like the code below:

phmap::flat_hash_map<int, int> map;

// fill map with some kv pairs

#pragma omp parallel for
for (int i = beg; i < end; ++i) {
  auto it = map.find(i);
  if (it != map.end()) {
    std::cout << it->second << std::endl;
  } else {
    std::cout << "not found\n";
  }
}

Yes, if the map is not modified, it is perfectly safe to check whether it contains keys from multiple threads.

Thanks Greg! Another question please: if I have a lot of (tens of millions maybe) simple <int, int> pairs to insert into a hash map, is it a good idea to use a parallel_flat_hash_map and insert using multiple threads? My concern is that the number of data to insert is huge, and a single insert operation is lightweight, which may make the mutex lock a overhead.

Sure, it still would be much faster to use a parallel_flat_hash_map and multiple threads, just make the N template parameter larger than the default 4, maybe 10 or something, and there will be very little mutex contention.

Another possibility. Where are all those pairs you want to insert coming from? If you can iterate very quickly over them, you can actually insert in a parallel_flat_hash_map without locking at all.

Another possibility. Where are all those pairs you want to insert coming from? If you can iterate very quickly over them, you can actually insert in a parallel_flat_hash_map without locking at all.

Hmm... I have an array storing m offsets of a file. What I want to do is iterate over the array, read some bytes starting at the offset into memory, then store the pair <offset, index> in the map. So I don't think it can be done quickly. But how can iterating quickly make the insertion lock-free anyway?

Actually that would work well. The parallel-flat-hash internally has an array of submaps (When N=4 you have 16 submaps).
What you would do is start 16 threads, and each thread would populate its own submap (thread # 0 populate submap 0, etc.....
So each thread would:

  • iterate over all the offsets in the file.
  • for each offset, check which submap it would go to (using the submap function).
    If not its target submap, it would do nothing.
    If it is, it would read some bytes starting at the offset into memory, then store the pair <offset, index> in the map.

that's it, no locking necessary.

This is what the bench does here. I really should write a better example.

That's cool! By the way, when using a lock-free parallel hash map, is there a difference among insertion methods, like operator[]/insert/emplace?

If you use the method I indicated above, use emplace_with_hash so you pass the hashval and it is not recomputed. Otherwise it doesn't make much difference when you insert a pair of integers.

Good to know! And thanks again for this awesome work and all your help!