transient sets aggregated with push=onchange not leaving L1 after deletion on L0

Question

transient sets aggregated with push=onchange not leaving L1 after deletion on L0

baallan opened this issue a year ago · comments

The pid sampler (linux-proc-sampler) creates sets for pids it's notified of and clears these sets when the pid monitored goes away. It appears that ~1/3 of these sets (388/1200) never get cleaned up on the L1 aggregator, even after 1300 seconds.

Benjamin Allan · Answer 1 · Thu Mar 16 2023 14:13:12 GMT+0800 (China Standard Time)

Will run more diagnostics to see if this looks a lot like #771 at the L1.

Benjamin Allan · Answer 2 · Mon Mar 20 2023 06:11:32 GMT+0800 (China Standard Time)

The symptom is repeatable, but a bit arbitrary, thusly:

run pid sampler on 20 nodes.
launch 36 ranks/node of mpiGraph using mpiexec or srun.
job completes; tracked pids all gone.
for 2 arbitrary (varies run to run which node) nodes, the 36 local pid-related sets are deleted by the sampler, but hang in the deleting state according to set_stats.
Also for those same nodes, the messages that the pids are gone do not get delivered to the L1, even though all such messages are published at L0 before the corresponding set_deletes are performed. This is just plain weird and in fact all subsequent message deliveries (e.g. new pid appearance) on the same 2 nodes also do not reach L1.

A look with gdb shows that all threads are in epoll wait at what looks like a 'normal' place in their respective task loops.

Benjamin Allan · Answer 3 · Wed Apr 05 2023 09:11:51 GMT+0800 (China Standard Time)

It appears this issue has been resolved by #1153.

Benjamin Allan · Answer 4 · Wed Apr 05 2023 23:23:56 GMT+0800 (China Standard Time)

Actually, it's still found in further testing...
@nichamon @tom95858 what's the good new encantation to enable debugging messages for just the set management?

Benjamin Allan · Answer 5 · Thu Apr 06 2023 04:43:52 GMT+0800 (China Standard Time)

@nichamon
my launch of mpigraph is

#! /bin/bash
#SBATCH --time=60:00
#SBATCH -N 20
srun --mpi=pmi2 -N 20 -n $((36*20)) mpiGraph 16384 100 50

where the 20 nodes are dual 18-core processors, hence 36 * 20 tasks.
The daemon setup is:

spank plugin running notifications to ldmsd on the compute nodes of new pids.
compute node ldmsd are all aggregated to a single admin node agg ldmsd.
The compute nodes create and delete sets as PIDs come/go driven by slurm (linux_proc_sampler).

For some subset of the nodes, something happens such that the deleting_count is permanently raised.

c1x8: Name                 Count
c1x8: -------------------- ----------------
c1x8: active_count                       15
c1x8: deleting_count                    109
c1x8: mem_total_kb                    16384
c1x8: mem_free_kb                     15452
c1x8: mem_used_kb                       932
c1x8: set_load                            0

nichamon · Answer 6 · Thu Apr 06 2023 05:01:25 GMT+0800 (China Standard Time)

@baallan Thanks for the info!

I assume that no L2 aggregated from L1.

Benjamin Allan · Answer 7 · Thu Apr 06 2023 05:03:46 GMT+0800 (China Standard Time)

@baallan Thanks for the info!

I assume that no L2 aggregated from L1.

Correct

nichamon · Answer 8 · Thu Apr 06 2023 12:03:23 GMT+0800 (China Standard Time)

@tom95858

Also for those same nodes, the messages that the pids are gone do not get delivered to the L1, even though all such messages are published at L0 before the corresponding set_deletes are performed. This is just plain weird and in fact all subsequent message deliveries (e.g. new pid appearance) on the same 2 nodes also do not reach L1.

This observation made me think that the root cause may not be related to the delete and push path. It is instead in the transport path. Did Ben and you reproduce this today? What do you think?

Benjamin Allan · Answer 9 · Thu Apr 06 2023 22:43:13 GMT+0800 (China Standard Time)

@nichamon with the new logging stuff, whats the syntax to turn on logging for just the transport (and maybe the set management) code but no sampler plugin logging?
or do we still need a recompile with extra flags for transport?

nichamon · Answer 10 · Thu Apr 06 2023 23:15:09 GMT+0800 (China Standard Time)

@nichamon with the new logging stuff, whats the syntax to turn on logging for just the transport (and maybe the set management) code but no sampler plugin logging? or do we still need a recompile with extra flags for transport?

With the top of OVIS-4, everything is still the same regarding changing log levels. I'm incrementally creating patches of the refactored code.

When all the refactored code is in the tree, you will do 'loglevel regex=xprt.* level=DEBUG' to turn on the DEBUG messages of all transport layers, i.e., ldmsd, ldms, and Zap.

nichamon · Answer 11 · Thu Apr 06 2023 23:16:29 GMT+0800 (China Standard Time)

@baallan I'm making a patch and will create a pull request within 1-2 hours from now. I'll tag you when I have it for you to test.

Benjamin Allan · Answer 12 · Fri May 05 2023 13:58:04 GMT+0800 (China Standard Time)

@nichamon @tom95858 This problem also reproduces with qlogic InfiniPath_QLE7340 adapters on toss3, the hardware on our tlcc2 systems, so it's not just omnipath-specific rdma with the sets stuck in deleting mode. The dev system here with that hardware is called btaco. The version running is near the top of tree (sum 7dfaa).

nichamon · Answer 13 · Sat May 06 2023 00:52:49 GMT+0800 (China Standard Time)

@nichamon @tom95858 This problem also reproduces with qlogic InfiniPath_QLE7340 adapters on toss3, the hardware on our tlcc2 systems, so it's not just omnipath-specific rdma with the sets stuck in deleting mode. The dev system here with that hardware is called btaco. The version running is near the top of tree (sum 7dfaa).

@baallan Thanks for the info. I'll try to send a patch to you to get more diagnostic messages, or we can setup a session to work on it together, hopefully, next week.

Benjamin Allan · Answer 14 · Mon May 08 2023 23:19:13 GMT+0800 (China Standard Time)

@nichamon if you have a branch with more debugging stuff to try (possibly including additional -D flags at compile time), let me know.