Detecting and linking against tcmalloc if present
ipelupessy opened this issue · comments
This issue serves as a reminder for a possible addition to the build system in the form of (optionally) detecting and
linking against the tcmalloc library (google thread aware allocator); as such it should probably wait for #950.
The reason for this is the improved performance for some parts of the xtp code (GW).
This can be seen in the attached timings.
alloc.pdf
(this plot shows timings of a xtp_tools -e dftgwbse task run for benzene molecule as a function of number of threads. Different
colors indicate different tasks in the run (DFT, GW or BSE) and linestyles indicate different allocater (default (ptmalloc), tcmalloc or jemalloc - though tcmalloc and jemalloc are mostly very similar. As can be seen the GW task in particular is sensistive to the
malloc used)
As an alternative; the code involved can be examined in more detail and maybe the offending allocation refactored out.
out of curiosty, which options did you use to run GW
I used the same setup I think as in your benchmarks:
-c gwbse.gw.mode=G0W0 dftpackage.auxbasisset=aux-def2-tzvp gwbse.ranges=full
okay very interesting.
I did not know GW had that many allocations especially with the ppm
just changing the gw.mode to evGW slows it down tremendously, but similar speedup for tcmalloc, eg. 1250 vs 1900 sec (for 2 core test)
but there are a lot of other options (some of which I am testing) - any particular suggestions for things to try?
evGW
is "just" repeating the G0W0 steps several times, so I guess the scaling is the same-ish just overall N-times slower
but there are a lot of other options (some of which I am testing) - any particular suggestions for things to try?
have you tried exact
instead of ppm
? (before getting crazy with cda
...)
btw, which system size are you using?
I am testing with benzene here (also doing tests with naphtalene);
(indeed I figured that evGW has lot of repeat of the same calls)
exact has same pattern (550s vs 400s with tcmalloc);
cda is a lot slower again (5500 sec) - I think I also had to change to gwbse.gw.qp_solver=cda for this (otherwise it seemed to freeze at some point, but maybe I was too impatient). There is no difference in this case with tcmalloc
I also tried gwbse.gw.qp_solver=fixedpoint - the other option besides cda and grid for this) (with the default ppm, G0W0) this is a faster, and shows some speedup with tcmalloc (9.2 and 7.3 sec)
I am not sure, but qp_solver=cda should not be there. @JensWehner ?
From what you see there, the alloc business seems to be related to solving the qp equations on grid. The fixed point solver is expected to require by far fewer evaluations of sigma (but can find "wrong" solutions).
And as you saw yourself, CDA is a disaster. Grid is practically impossible to run. Maybe you can try with a very small molecule, like CO.
qp_solver=cda should not be valid, I am surprised it worked. Yes CDA is not very performant, but that was expected.
yea, I think the option is fixed now
@ipelupessy the cmake refactor is done, find_package(Threads REQUIRED)
only shows up once in the main CMakeLists.txt
file, so adding tcmalloc
support should be easier.