votca / votca

The source of the votca-csg and xtp packages

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Detecting and linking against tcmalloc if present

ipelupessy opened this issue · comments

This issue serves as a reminder for a possible addition to the build system in the form of (optionally) detecting and
linking against the tcmalloc library (google thread aware allocator); as such it should probably wait for #950.

The reason for this is the improved performance for some parts of the xtp code (GW).
This can be seen in the attached timings.
alloc.pdf
(this plot shows timings of a xtp_tools -e dftgwbse task run for benzene molecule as a function of number of threads. Different
colors indicate different tasks in the run (DFT, GW or BSE) and linestyles indicate different allocater (default (ptmalloc), tcmalloc or jemalloc - though tcmalloc and jemalloc are mostly very similar. As can be seen the GW task in particular is sensistive to the
malloc used)

As an alternative; the code involved can be examined in more detail and maybe the offending allocation refactored out.

commented

out of curiosty, which options did you use to run GW

I used the same setup I think as in your benchmarks:
-c gwbse.gw.mode=G0W0 dftpackage.auxbasisset=aux-def2-tzvp gwbse.ranges=full

commented

okay very interesting.

commented

I did not know GW had that many allocations especially with the ppm

just changing the gw.mode to evGW slows it down tremendously, but similar speedup for tcmalloc, eg. 1250 vs 1900 sec (for 2 core test)

but there are a lot of other options (some of which I am testing) - any particular suggestions for things to try?

evGW is "just" repeating the G0W0 steps several times, so I guess the scaling is the same-ish just overall N-times slower

but there are a lot of other options (some of which I am testing) - any particular suggestions for things to try?

have you tried exact instead of ppm? (before getting crazy with cda...)

commented

btw, which system size are you using?

I am testing with benzene here (also doing tests with naphtalene);
(indeed I figured that evGW has lot of repeat of the same calls)

exact has same pattern (550s vs 400s with tcmalloc);

cda is a lot slower again (5500 sec) - I think I also had to change to gwbse.gw.qp_solver=cda for this (otherwise it seemed to freeze at some point, but maybe I was too impatient). There is no difference in this case with tcmalloc

I also tried gwbse.gw.qp_solver=fixedpoint - the other option besides cda and grid for this) (with the default ppm, G0W0) this is a faster, and shows some speedup with tcmalloc (9.2 and 7.3 sec)

I am not sure, but qp_solver=cda should not be there. @JensWehner ?

From what you see there, the alloc business seems to be related to solving the qp equations on grid. The fixed point solver is expected to require by far fewer evaluations of sigma (but can find "wrong" solutions).

And as you saw yourself, CDA is a disaster. Grid is practically impossible to run. Maybe you can try with a very small molecule, like CO.

commented

qp_solver=cda should not be valid, I am surprised it worked. Yes CDA is not very performant, but that was expected.

yea, I think the option is fixed now

@ipelupessy the cmake refactor is done, find_package(Threads REQUIRED) only shows up once in the main CMakeLists.txt file, so adding tcmalloc support should be easier.