HPCToolkit / hpctoolkit

HPCToolkit performance tools: measurement and analysis components

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Trace lines are not sorted for the develop branch using prof2

mxz297 opened this issue · comments

@jmellorcrummey @blue42u Trace lines generated by hpctoolkit's develop branch are not sorted by ranks. And one rank appears to show up on multiple nodes. This issue does not show up for the master branch. So this is not a viewer problem.

Below is a screenshot from the database generated from develop. The light brown trace lines represent some helper threads that are always idle. They should be one thread per rank the trace lines should be equally separated:

image

Below is a screenshot from the database generated from master. Traces look resonable:

image

I looked into this a bit. Perlmutter has a known issue (3rd-from-last in https://docs.nersc.gov/current/) that "nodes on Perlmutter currently do not get a constant hostid (IP address) response." So not only will co-located ranks potentially not get the same hostid, but multiple calls to gethostid may return different values. Which breaks our code that assumes the hostid is constant...

...Except it doesn't! Most of the code only calls gethostid via OSUtil_hostid, which contains an internal process-wide (and yes, thread-unsafe) cache for the gethostid return value. The only cases that don't are the two added for Prof2:

So, 2-line patch. Plus ~5 lines to fix the thread-unsafety of the cache in OSUtil_hostid if that's a concern.

While looking into that, I also discovered the root of the large negative NODE values come from these two lines as well. gethostid returns a long but treats it as a 32-bit integer, on 64-bit machines long is 64-bit so this is a sign-extended value. When this then gets cast to a uint64_t for the id_tuple_push_back argument it retains the sign extension, creating a very large value (and Java doesn't have unsigned integers, so it comes out as negative in the Viewer). OSUtil_hostid handles this oddity internally (by casting to uint32_t and back), so two birds with one 2-line patch.

@jmellorcrummey If there are no additional concerns I'll just push a 2-line commit to develop to get rid of this issue.

@blue42u I presume what you mean by the two line patch is replacing the get_host_id calls above with OSUtil_hostid. That seems fine. Also, fixing OSUtil_hostid to address the following two issues would also be useful

  • we treat host ids as unsigned so we don't do sign extension
  • the process is thread safe