greg7mdp / parallel-hashmap

A family of header-only, very fast and memory-friendly hashmap and btree containers.

Home Page:https://greg7mdp.github.io/parallel-hashmap/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Simple Parallel Example

atom-moyer opened this issue · comments

Hi Greg,

I am trying to do a simple parallel example. For some reason I am only getting 100% cpu usage, could you help explain what is going wrong? I think it will be an informative solution/documentation for people.

I am certain that my pragma commands are not being ignored by the compiler. I am not super familiar with omp, so maybe I am doing something wrong there. I have tried setting the number of threads in the pragma and on the command line.

Let me know if you need any more information. I already spent a lot of time trying to understand all of the possible solutions, but seems like it is annoyingly hard to parallelize a for loop like this. I don't really want to set up all of the threads and require them to be configured. Seems like omp would be great because you can define it on the command line.

template <typename Key, typename Value>
struct Dict {
    Dict () {}
    Dict ( Value default_value ) : default_value ( default_value ) {}

    void __setitem__ ( py::array_t<Key> & key_array, py::array_t<Value> & value_array ) {
        auto * key_array_ptr = (Key *) key_array.request().ptr;
        auto * value_array_ptr = (Value *) value_array.request().ptr;

        if ( key_array.size() != value_array.size() )
            throw std::runtime_error("The size of the key and value must match.");

        #pragma omp parallel for
        for ( size_t idx = 0; idx < key_array.size(); idx++ ) {
            dict.insert_or_assign( key_array_ptr[idx], value_array_ptr[idx] );
        }
    }

    py::array_t<Value> __getitem__ ( py::array_t<Key> & key_array ) {
        auto * key_array_ptr = (Key *) key_array.request().ptr;

        auto result_array = py::array_t<Value> ( key_array.request().shape );
        auto * result_array_ptr = (Value *) result_array.request().ptr;

        #pragma omp parallel for
        for ( size_t idx = 0; idx < key_array.size(); idx++ ) {
            auto search = dict.find( key_array_ptr[idx] );

            if ( search != dict.end() ) {
                result_array_ptr[idx] = search->second;
            } else {
                result_array_ptr[idx] = default_value;
            }
        }

        return result_array;
    }

    Value default_value;
    phmap::parallel_flat_hash_map<
        Key, 
        Value, 
        phmap::priv::hash_default_hash<Key>, 
        phmap::priv::hash_default_eq<Key>, 
        phmap::priv::Allocator<phmap::priv::Pair<Key,Value>>, 
        4, 
        std::mutex
    > dict;
};

Hum, getting getting 100% cpu usage means you are using all cores to the max. This seems like it is working fine to me. What am I missing?

Hmm. I would have expected to be at 600-800% on top. Also, if I compile without the pragma, I still get the same runtime and 100% cpu usage.

But from inspection, you think that this should work?

I'm not knowledgeable with openmp either, but code looks fine to me. Did you link with -fopenmp?

Also maybe you need curly braces after the #pragma:

#pragma omp parallel for
{
        for ( size_t idx = 0; idx < key_array.size(); idx++ ) {
           ...
        }
}

OK just checked and you don't need braces.

I am nearly certain that I am linking with -fopenmp properly. I get errors if I write invalid #pragma commands, which should be ignored if openmp is not getting linked properly.

I have some time today to work on OMP/std::thread-based solutions. I suppose it would be helpful if you could write an example using std::thread to split a for loop like this. I think it could go right into the example dir.

Or maybe, it would be ideal if the parallel maps/sets had an optimized parallel function for assigning or finding vectors of Key/Value.

Sure, I'll write an example, but not right now :-)

@atom-moyer could it be the env var OMP_NUM_THREADS setting? Could you check

#include<omp.h>

print out the return of

omp_get_max_threads()

right before the two pragma sections?

@greg7mdp is the read operation here thread-safe?

            if ( search != dict.end() ) {
                result_array_ptr[idx] = search->second;
            } else {
                result_array_ptr[idx] = default_value;
            }

Looks like writes by insert_or_assign() is thread-safe but the iterator returned could be invalidated.

@jrcavani You are correct, the iterator returned by find() is not safe to use in a multithreaded context if the phmap is modified. Just looking at the code above dict is not modified, and because python has a global lock maybe this code is fine.

However, if there is a chance that the map could be changed in another thread, the safe way to do this would be to use an extended phmap API (which calls a lambda within the phmap internal lock), like:

for ( size_t idx = 0; idx < key_array.size(); idx++ ) {
    if (!dict.if_contains(idx, [&](const decltype(dict)::value_type& v) { result_array_ptr[idx] = v.second; }))
        result_array_ptr[idx] = default_value;
    }
}

Thanks! Maybe that's why @atom-moyer is seeing only 100% CPU usage - GIL. I'm not sure how Python maintains it once two Python threads decide to do concurrent reads and writes.

@jrcavani I don't think so.

@jrcavani @greg7mdp I think my issue was the macOS dev environment (omp seems to be in a state of shambles with clang).

I now tested on a linux system, and I am getting ~350%/400% cpu usage during insertion. I can confirm that multiple threads are being used now, and they respect the OMP_NUM_THREADS env variable. With 4 threads I get about 100M insertions per 20 seconds, and with 1 thread I get about 100M insertions per 35 seconds. It is around 2.5 times slower than the benchmark that Greg originally posted, but that is still absurdly good for numpy.

This is with the mutex lock (which I think I need if the table is write-able, correct?). Also, it seems that the OMP solution does not respect the order of the for loop except when you have 1 thread, which I guess makes sense.

I think we can conclude that this is an easy way to parallelize the phmap, but it could use some optimization if it is the official example.

I added a small example. I should add another one showing the other extended APIs.