Thread safe way to check existence of items

Question

Thread safe way to check existence of items

easypickings opened this issue 9 months ago · comments

I'm not quite familiar with thread safety issues, but suppose I have a hash map that has been initialized with some key-value pairs. Is it okay to store the data in a flat_hash_map and use STL-like ops to check if a key exists in the map with multi-threading (the data is assured not to be altered during the checking)?

Specifically, what I want to do is something like the code below:

phmap::flat_hash_map<int, int> map;

// fill map with some kv pairs

#pragma omp parallel for
for (int i = beg; i < end; ++i) {
  auto it = map.find(i);
  if (it != map.end()) {
    std::cout << it->second << std::endl;
  } else {
    std::cout << "not found\n";
  }
}

Gregory Popovitch · Answer 1 · Thu Dec 28 2023 23:35:09 GMT+0800 (China Standard Time)

Yes, if the map is not modified, it is perfectly safe to check whether it contains keys from multiple threads.

Can Su · Answer 2 · Fri Dec 29 2023 11:03:15 GMT+0800 (China Standard Time)

Thanks Greg! Another question please: if I have a lot of (tens of millions maybe) simple <int, int> pairs to insert into a hash map, is it a good idea to use a parallel_flat_hash_map and insert using multiple threads? My concern is that the number of data to insert is huge, and a single insert operation is lightweight, which may make the mutex lock a overhead.

Gregory Popovitch · Answer 3 · Fri Dec 29 2023 12:34:05 GMT+0800 (China Standard Time)

Sure, it still would be much faster to use a parallel_flat_hash_map and multiple threads, just make the N template parameter larger than the default 4, maybe 10 or something, and there will be very little mutex contention.

Another possibility. Where are all those pairs you want to insert coming from? If you can iterate very quickly over them, you can actually insert in a parallel_flat_hash_map without locking at all.

Can Su · Answer 4 · Fri Dec 29 2023 16:14:05 GMT+0800 (China Standard Time)

Another possibility. Where are all those pairs you want to insert coming from? If you can iterate very quickly over them, you can actually insert in a parallel_flat_hash_map without locking at all.

Hmm... I have an array storing m offsets of a file. What I want to do is iterate over the array, read some bytes starting at the offset into memory, then store the pair <offset, index> in the map. So I don't think it can be done quickly. But how can iterating quickly make the insertion lock-free anyway?

Gregory Popovitch · Answer 5 · Fri Dec 29 2023 22:01:52 GMT+0800 (China Standard Time)

Actually that would work well. The parallel-flat-hash internally has an array of submaps (When N=4 you have 16 submaps).
What you would do is start 16 threads, and each thread would populate its own submap (thread # 0 populate submap 0, etc.....
So each thread would:

iterate over all the offsets in the file.
for each offset, check which submap it would go to (using the submap function).
If not its target submap, it would do nothing.
If it is, it would read some bytes starting at the offset into memory, then store the pair <offset, index> in the map.

that's it, no locking necessary.

This is what the bench does here. I really should write a better example.

Can Su · Answer 6 · Sat Dec 30 2023 11:12:40 GMT+0800 (China Standard Time)

That's cool! By the way, when using a lock-free parallel hash map, is there a difference among insertion methods, like operator[]/insert/emplace?

Gregory Popovitch · Answer 7 · Sat Dec 30 2023 12:48:52 GMT+0800 (China Standard Time)

If you use the method I indicated above, use emplace_with_hash so you pass the hashval and it is not recomputed. Otherwise it doesn't make much difference when you insert a pair of integers.

Can Su · Answer 8 · Sat Dec 30 2023 14:50:51 GMT+0800 (China Standard Time)

Good to know! And thanks again for this awesome work and all your help!