we have a slow leak

Question

we have a slow leak

jaredmales opened this issue 8 months ago · comments

Now that we have a fairly stable system and have our software framework up and running for weeks to months at a time, it's clear we have a slow memory leak somewhere.

There is no reason at all that koolanceCtrl should be holding 5 GB of RAM.

For reference

[jrmales@exao3 config]$ ps -o etime= -p 1341779 
20-06:27:05

Biggest suspect is of course the INDI subsystem. The other subsystem that is constantly going is logger/telemetry system.

Jared R. Males · Answer 1 · Wed Dec 13 2023 03:08:55 GMT+0800 (China Standard Time)

with valgrind found one problem in IndiConnection caused by my fix to the output code.

472 bytes in 1 blocks are still reachable in loss record 6 of 12
==160426==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==160426==    by 0x4F2CF62: fdopen@@GLIBC_2.2.5 (iofdopen.c:122)
==160426==    by 0x1637E8: pcf::IndiConnection::setOutputFd(int const&) (IndiConnection.cpp:419)

See d054988

Joseph D. Long · Answer 2 · Wed Dec 13 2023 03:12:03 GMT+0800 (China Standard Time)

Huh, I'm confused how that leaks. I guess glibc allocates for the fdopen and needs an fdclose to deallocate? Sneaky

Joseph D. Long · Answer 3 · Wed Dec 13 2023 03:14:43 GMT+0800 (China Standard Time)

Oh it's allocating the FILE struct I got it

Jared R. Males · Answer 4 · Wed Dec 13 2023 03:16:29 GMT+0800 (China Standard Time)

yeah something in FILE is getting a malloc, and if you do fdopen again without fclose, that hangs around.

tbc I don't think this is causing the GB/week

Jared R. Males · Answer 5 · Wed Dec 13 2023 04:51:26 GMT+0800 (China Standard Time)

further valgrind testing with magaoxMaths isn't showing anything interesting. some stuff happening with pthreads causing still accessible chunks at exit, but no actual leaks.

the maths demos are not telemeters but they do log.

The main offenders are all tty users. That needs to be checked!

Jared R. Males · Answer 6 · Wed Dec 13 2023 12:39:09 GMT+0800 (China Standard Time)

Found it! Or least some of it. Testing on koolanceCtrl, a user of tty::usb-> udev, we get

==3640043== LEAK SUMMARY:
==3640043==    definitely lost: 2,560 bytes in 20 blocks
==3640043==    indirectly lost: 102,384 bytes in 1,912 blocks
==3640043==      possibly lost: 0 bytes in 0 blocks
==3640043==    still reachable: 8 bytes in 1 blocks
==3640043==         suppressed: 0 bytes in 0 blocks

Jared R. Males · Answer 7 · Wed Dec 13 2023 15:01:22 GMT+0800 (China Standard Time)

udev issues fixed by 49dcd5d

valgrind branch merged with dev, now being tested on ICC

Jared R. Males · Answer 8 · Mon Dec 18 2023 12:44:19 GMT+0800 (China Standard Time)

after 5 days it looks like problem fixed.