greg7mdp / parallel-hashmap

A family of header-only, very fast and memory-friendly hashmap and btree containers.

Home Page:https://greg7mdp.github.io/parallel-hashmap/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

nvc++: integer conversion resulted in a change of sign

marioroy opened this issue · comments

I tried nvc++ (NVIDIA HPC SDK) and thought to pass this along. I ran git pull and have the latest from master.

$ nvc++ -o llil4hmap -std=c++20 -fopenmp -Wall -O3 llil4hmap.cc -I./parallel-hashmap
"./parallel-hashmap/parallel_hashmap/phmap.h", line 434: warning: integer conversion resulted in a change of sign [integer_sign_change]
          auto special = _mm_set1_epi8(static_cast<uint8_t>(kSentinel));
                                       ^
Remark: individual warnings can be suppressed with "--diag_suppress <warning-name>"

"./parallel-hashmap/parallel_hashmap/phmap.h", line 442: warning: integer conversion resulted in a change of sign [integer_sign_change]
          auto special = _mm_set1_epi8(static_cast<uint8_t>(kSentinel));
                                       ^

The llil4hmap.cc demonstration is found in my gist repo.

https://gist.github.com/marioroy/3924c48e140f8330f25f67cd98a815ef

I reached out to NVIDIA. The parallel-hashmap usage in llil4hmap is not a typical pattern, but something I tried out of curiosity at the time. I replied to try llil4map.cc instead for investigating why nvc++ is noticeably slower compared to clang++.

https://forums.developer.nvidia.com/t/nvfortran-23-9-problem/270618/13?u=marioeroy

Thank you @marioroy , I'm very busy right now, but I'll look into this when I can I promise.

Have met the same warnings with nvcc. Seems simply including the header will trigger the warning:

#include <phmap.h>

int main() {}

Compiled with nvcc -Iinclude/parallel_hashmap test.cu -o test. Warnings:

include/parallel_hashmap/phmap.h(434): warning #68-D: integer conversion resulted in a change of sign

include/parallel_hashmap/phmap.h(442): warning #68-D: integer conversion resulted in a change of sign

include/parallel_hashmap/phmap.h(434): warning #68-D: integer conversion resulted in a change of sign

include/parallel_hashmap/phmap.h(442): warning #68-D: integer conversion resulted in a change of sign

I tried the suggestion, emitted by nvc++. Individual warnings can be suppressed with --diag_suppress <warning-name>.

nvc++ --diag_suppress=integer_sign_change ...

I also tried nvc++ to build a Python extension, Accelerating Python on GPUs with nvc++ and Cython.

nvc++ --diag_suppress=declared_but_not_referenced,identifier_not_keyword,set_but_not_used ...

Warnings are fixed with latest version.

Hi, @greg7mdp

The NVIDIA nvc++ compiler is interesting. I fixed the three map demonstrations llil4map.cc, llil4hmap.cc, and llil4emh.cc. No more slowness using nvc++.

  1. Changed std::mutex to included spinlock_mutex. Noticeably better performance. However, this resolves nvc++ taking greater than 40 seconds to compile llil4hmap.cc and llil4emh.cc.
  2. Fixed out_properties. This ran poorly using nvc++, taking > 4 seconds. The program built with clang++ outputs in 0.7 seconds. It turns out that allocating a basic string inside a parallel loop causes nvc++ OpenMP to perform worst than non-parallel.
#ifdef MAX_STR_LEN_L
      // std::basic_string<char> s { it->first.data(), MAX_STR_LEN_L };
      // str.append(s.c_str());
         str.append(it->first.data());
#else
         str.append(it->first.data(), it->first.size());
#endif

Results:

llil4map consumes the least memory. The llil4hmap demonstration is similar to llil4map, but computes the hash_value one time per key. This was done out of curiosity. The llil4emh demonstration uses emhash7::HashMap for comparison.

$ NUM_THREADS=60 ./llil4map in/big* in/big* in/big* | cksum
llil4map (fixed string length=12) start
use OpenMP
use boost sort        g++            clang++        nvc++
get properties         7.864 secs     7.866 secs     7.973 secs
map to vector          1.140 secs     0.881 secs     0.876 secs
vector stable sort     1.681 secs     1.098 secs     1.105 secs
write stdout           0.624 secs     0.595 secs     0.590 secs
total time            11.311 secs    10.441 secs    10.546 secs
    count lines     970195200
    count unique    200483043
2057246516 1811140689

$ NUM_THREADS=60 ./llil4hmap in/big* in/big* in/big* | cksum
llil4hmap (fixed string length=12) start
use OpenMP
use boost sort        g++            clang++        nvc++
get properties         7.245 secs     7.197 secs     7.617 secs
map to vector          1.179 secs     0.841 secs     0.809 secs
vector stable sort     1.682 secs     1.103 secs     1.113 secs
write stdout           0.647 secs     0.573 secs     0.571 secs
total time            10.754 secs     9.715 secs    10.112 secs
    count lines     970195200
    count unique    200483043
2057246516 1811140689

$ NUM_THREADS=60 ./llil4emh in/big* in/big* in/big* | cksum
llil4emh (fixed string length=12) start
use OpenMP
use boost sort        g++            clang++        nvc++
get properties         6.150 secs     6.250 secs     6.392 secs
map to vector          1.016 secs     0.877 secs     0.832 secs
vector stable sort     1.681 secs     1.127 secs     1.094 secs
write stdout           0.608 secs     0.561 secs     0.595 secs
total time             9.457 secs     8.817 secs     8.914 secs
    count lines     970195200
    count unique    200483043
2057246516 1811140689

Thank you for the tip to reclaim memory.

MyMap().swap(map); // swap map with an empty temporary, which is immediately destroyed

Thanks, interesting data, emhash seems to be pretty good. Personally, I'm not really focusing on benchmarks, but more on having the hash map perform well in most cases.

Previously, the program ran poorly using nvc++, particularly "write stdout". The spinlock_mutex sped up "get properties".

$ ./llil4hmap /data1/input/big* | cksum
llil4hmap (fixed string length=12) start
use OpenMP
use boost sort        Before          After
get properties         3.804 secs      2.797 secs
hmap to vector         0.697 secs      0.792 secs
vector stable sort     1.104 secs      1.148 secs
write stdout           5.071 secs      0.564 secs
total time            10.678 secs      5.302 secs
    count lines     323398400
    count unique    200483043
701308064 1804347429

That's great, thanks for doing that. Maybe I should add the spinlock_mutex to phmap so people can easily try it out?