Parsl / parsl

Parsl - a Python parallel scripting library

Home Page:http://parsl-project.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Potential memory leaks when using ThreadPoolExecutor

d33bs opened this issue · comments

Describe the bug

Thank you for the excellent work on this project! When working with Parsl ThreadPoolExecutor in looped app iterations I've noticed that there are often uncollectable objects which remain in the garbage. This appears to cause memory leakage, consuming more and more memory on the system where otherwise the memory would be deallocated. Many of these objects are relatively small in size with the exception of python_app dictionaries which seem to represent parameters passed to the app on implementation through .result(). In scenarios where values passed to these parameters are large and/or complex (such as nested dictionary or list structures) the memory leakage can be especially significant.

I've approached testing for this using gc.set_debug(gc.DEBUG_LEAK) in order to assess what the Python garbage collector is unable to collect. When performing these tests, I've found that pure python and HighThroughputExecutor processing of simple python_app contents show no uncollected garbage after completing. ThreadPoolExecutor on the other hand does show uncollected garbage objects after completion.

Some of this behavior can be observed in charts where we're implementing Parsl here (see charts comparing looped iterations using ThreadPoolExecutor and HighThroughputExecutor. The reproduction of this issue is better isolated below.

Please don't hesitate to let me know if I'm approaching memory analysis incorrectly or in a way that is unsuitable to how Parsl operates.

To Reproduce

See the following gist for more details on reproducing the items mentioned above.
https://gist.github.com/d33bs/ff7daa1ea99237b0fb02b7bd95f95997

Expected behavior

My hope is we can find a way to match ThreadPoolExecutor with that of HighThroughputExecutor or to outline / document workarounds for scenarios where manual garbage collection after Parsl app runs are completed.

Environment

  • OS: MacOS
  • Python version: Python 3.10
  • Parsl version: 2023.9.11

Distributed Environment

  • Where are you running the Parsl script from ? Laptop/Workstation
  • Where do you need the workers to run ? Same as Parsl script

ok, I can recreate that in my dev environment.

it looks like there are additional references "somewhere" that aren't being seen by the garbage collector, but are included in the object reference count, which is why those objects are reported as garbage-but-not-collectable, I think.

I'm unclear where those would be coming from.

I'll poke around some more.

I dug into this some more.

The objects which this code returns in gc.garbage will be collected by a regular garbage collection when that happens: in your script, they're kept around because of gc.DEBUG_LEAK - or more specifically, the implied gc.DEBUG_SAVEALL which keeps garbage around instead of freeing it.

As an example, remove the initial set_debug and do this after calling the python app:


print(multiply(5, 9).result())

gc.collect()
gc.set_debug(gc.DEBUG_LEAK)

collected = gc.collect()

You should see the first collection garbage collect all the objects, and the second collection does not see any garbage any more (at least that's what it looks like for me).

So, for whatever reason, the Python garbage collector is happening to not run automatically here - running it manually (without set_debug(DEBUG_SAVEALL)) should be causing that garbage to go away, and maybe you can test that out in your larger original application code.

If that turns out to be what's going on in your original application, then I think it's not really the job of Parsl to be trying to override when Python's garbage collection happens: as a user you get to fiddle with that as you want.

Related to this, we saw similar garbage collection behaviour inside htex workers, and there is an article about that here:
https://parsl-project.org/2023/08/31/Debugging-LightningBug.html

Thanks @benclifford for the response and the link to the article here! I found the content in the article thought provoking.

Generally we've found that HighThroughputExecutor works beautifully, and this persists in the memory leak experiments shared within this issue (as well as in applied scenarios more broadly). Do you know if that team was able to find which specific objects were leaking (whether they came from Parsl itself or somewhere else in the code or perhaps the hardware/platform architecture + memory handling)?

On the chance that Python garbage collection debugging is acting up, I went ahead and created two more experiments in the earlier gist I shared to turn off gc.set_debug(gc.DEBUG_LEAK) entirely and go by solely the return value from gc.collect() to indicate unreachable objects. Here again I found that ThreadPoolExecutor shows a non-zero value for unreachable object count where HighThroughputExecutor shows zero unreachable objects. I updated the readme with some notes and added the related files to the same gist here.

As an aside for clarification, and depending on how you feel about the above findings, do you think this is a bug with Python garbage collection operations or documentation more generally?

I think this isn't a "leak" or a "bug" - it's the garbage collector not running as often as you hope it would. Some run of the garbage collector will collect (and report, if you turn on the right options) all the task relevant structures. Add a hook into gc.callbacks https://docs.python.org/3/library/gc.html#gc.callbacks and you'll be able to see all the runs of the gc, both launched by Python and manually invoked ones. What I think will happen (but I haven't tried it) is that with htex, you'll see the garbage collector running much more, collecting objects before you reach your manual garbage collection call. With your thread local test code, I don't see the garbage collector run by Python at all, probably because that code path is not very allocation intensive.

In the case of the blog post, I think we concluded that the memory use was some very large application-specific data structures - a mis-assumption that the author started out with was the memory allocated in a task would be released at least by the time the task ends, but that is not at all how Python garbage collection operates.

Thank you @benclifford ! I added new Parsl memory experiments to the gist in order to explore gc.callbacks as you suggested.

This followed what you mentioned about the HTEX executor running garbage collection more frequently than the TPE. In both cases when using gc.callbacks a large non-zero value was shown as a return value from gc.collect(). In constrast, the information reported from the callback indicated 0 uncollectable objects in both cases. I also added a double check for this by printing the len of gc.garbage which also shows 0 for both cases.

I don't know for sure, but this makes me feel that enabling gc.callbacks functionality somehow changes the operation of garbage collection (perhaps in a similar way to gc.set_debug(gc.DEBUG_LEAK) or that using functional references through the callback somehow incurs different rates of garbage collection).

The number coming back from gc.collect is not the number of uncollectable objects:

The number of unreachable objects found is returned.

It's the number of collected objects, approximately. (Unless you have DEBUG_LEAK set)

If you call gc.collect twice in a row with no other activity, the second call should return 0, for example.

I don't think anything in what you've shown is uncollectable objects - just objects collected later than you are expecting.