Trace lines are not sorted for the develop branch using prof2
mxz297 opened this issue · comments
@jmellorcrummey @blue42u Trace lines generated by hpctoolkit's develop branch are not sorted by ranks. And one rank appears to show up on multiple nodes. This issue does not show up for the master branch. So this is not a viewer problem.
Below is a screenshot from the database generated from develop. The light brown trace lines represent some helper threads that are always idle. They should be one thread per rank the trace lines should be equally separated:
Below is a screenshot from the database generated from master. Traces look resonable:
I looked into this a bit. Perlmutter has a known issue (3rd-from-last in https://docs.nersc.gov/current/) that "nodes on Perlmutter currently do not get a constant hostid
(IP address) response." So not only will co-located ranks potentially not get the same hostid, but multiple calls to gethostid
may return different values. Which breaks our code that assumes the hostid is constant...
...Except it doesn't! Most of the code only calls gethostid
via OSUtil_hostid
, which contains an internal process-wide (and yes, thread-unsafe) cache for the gethostid
return value. The only cases that don't are the two added for Prof2:
hpctoolkit/src/tool/hpcrun/gpu/gpu-trace.c
Line 280 in dfb2b75
hpctoolkit/src/tool/hpcrun/thread_data.c
Line 537 in dfb2b75
So, 2-line patch. Plus ~5 lines to fix the thread-unsafety of the cache in OSUtil_hostid
if that's a concern.
While looking into that, I also discovered the root of the large negative NODE
values come from these two lines as well. gethostid
returns a long
but treats it as a 32-bit integer, on 64-bit machines long
is 64-bit so this is a sign-extended value. When this then gets cast to a uint64_t
for the id_tuple_push_back
argument it retains the sign extension, creating a very large value (and Java doesn't have unsigned integers, so it comes out as negative in the Viewer). OSUtil_hostid
handles this oddity internally (by casting to uint32_t
and back), so two birds with one 2-line patch.
@jmellorcrummey If there are no additional concerns I'll just push a 2-line commit to develop
to get rid of this issue.
@blue42u I presume what you mean by the two line patch is replacing the get_host_id calls above with OSUtil_hostid. That seems fine. Also, fixing OSUtil_hostid to address the following two issues would also be useful
- we treat host ids as unsigned so we don't do sign extension
- the process is thread safe