Detecting and linking against tcmalloc if present

Question

Detecting and linking against tcmalloc if present

ipelupessy opened this issue 3 years ago · comments

This issue serves as a reminder for a possible addition to the build system in the form of (optionally) detecting and
linking against the tcmalloc library (google thread aware allocator); as such it should probably wait for #950.

The reason for this is the improved performance for some parts of the xtp code (GW).
This can be seen in the attached timings.
alloc.pdf
(this plot shows timings of a xtp_tools -e dftgwbse task run for benzene molecule as a function of number of threads. Different
colors indicate different tasks in the run (DFT, GW or BSE) and linestyles indicate different allocater (default (ptmalloc), tcmalloc or jemalloc - though tcmalloc and jemalloc are mostly very similar. As can be seen the GW task in particular is sensistive to the
malloc used)

As an alternative; the code involved can be examined in more detail and maybe the offending allocation refactored out.

Jens · Answer 1 · Thu Jan 27 2022 02:47:47 GMT+0800 (China Standard Time)

out of curiosty, which options did you use to run GW

Inti Pelupessy · Answer 2 · Thu Jan 27 2022 03:51:01 GMT+0800 (China Standard Time)

I used the same setup I think as in your benchmarks:
-c gwbse.gw.mode=G0W0 dftpackage.auxbasisset=aux-def2-tzvp gwbse.ranges=full

Jens · Answer 3 · Thu Jan 27 2022 21:20:48 GMT+0800 (China Standard Time)

okay very interesting.

Jens · Answer 4 · Thu Jan 27 2022 21:21:13 GMT+0800 (China Standard Time)

I did not know GW had that many allocations especially with the ppm

Inti Pelupessy · Answer 5 · Mon Jan 31 2022 23:11:18 GMT+0800 (China Standard Time)

just changing the gw.mode to evGW slows it down tremendously, but similar speedup for tcmalloc, eg. 1250 vs 1900 sec (for 2 core test)

Inti Pelupessy · Answer 6 · Mon Jan 31 2022 23:11:50 GMT+0800 (China Standard Time)

but there are a lot of other options (some of which I am testing) - any particular suggestions for things to try?

Bjoern Baumeier · Answer 7 · Mon Jan 31 2022 23:18:49 GMT+0800 (China Standard Time)

evGW is "just" repeating the G0W0 steps several times, so I guess the scaling is the same-ish just overall N-times slower

Bjoern Baumeier · Answer 8 · Mon Jan 31 2022 23:19:43 GMT+0800 (China Standard Time)

but there are a lot of other options (some of which I am testing) - any particular suggestions for things to try?

have you tried exact instead of ppm? (before getting crazy with cda...)

Jens · Answer 9 · Tue Feb 01 2022 00:01:42 GMT+0800 (China Standard Time)

btw, which system size are you using?

Inti Pelupessy · Answer 10 · Tue Feb 01 2022 03:47:56 GMT+0800 (China Standard Time)

I am testing with benzene here (also doing tests with naphtalene);
(indeed I figured that evGW has lot of repeat of the same calls)

exact has same pattern (550s vs 400s with tcmalloc);

cda is a lot slower again (5500 sec) - I think I also had to change to gwbse.gw.qp_solver=cda for this (otherwise it seemed to freeze at some point, but maybe I was too impatient). There is no difference in this case with tcmalloc

I also tried gwbse.gw.qp_solver=fixedpoint - the other option besides cda and grid for this) (with the default ppm, G0W0) this is a faster, and shows some speedup with tcmalloc (9.2 and 7.3 sec)

Bjoern Baumeier · Answer 11 · Tue Feb 01 2022 03:59:03 GMT+0800 (China Standard Time)

I am not sure, but qp_solver=cda should not be there. @JensWehner ?

From what you see there, the alloc business seems to be related to solving the qp equations on grid. The fixed point solver is expected to require by far fewer evaluations of sigma (but can find "wrong" solutions).

And as you saw yourself, CDA is a disaster. Grid is practically impossible to run. Maybe you can try with a very small molecule, like CO.

Jens · Answer 12 · Mon Mar 14 2022 01:10:52 GMT+0800 (China Standard Time)

qp_solver=cda should not be valid, I am surprised it worked. Yes CDA is not very performant, but that was expected.

Inti Pelupessy · Answer 13 · Mon Mar 14 2022 20:34:21 GMT+0800 (China Standard Time)

yea, I think the option is fixed now

Christoph Junghans · Answer 14 · Sun Jan 08 2023 23:38:16 GMT+0800 (China Standard Time)

@ipelupessy the cmake refactor is done, find_package(Threads REQUIRED) only shows up once in the main CMakeLists.txt file, so adding tcmalloc support should be easier.