Memory consumption
robertdj opened this issue · comments
Hi!
I am running into memory issues when using Ploomber on large(-ish) datasets. Kinda hard to reproduce, so I'll start with a description. Let me know if more is needed.
When running a Ploomber pipeline the memory consumption of the Python process is increasing throughout -- as if does not release memory after finishing a step.
Sometimes the Python process is killed with an "Out of Memory" exception during a step.
If I then re-run the pipeline (such that it starts with the step where it was just killed) it works w/o errors.
Is this a behavior that others recognize? Can I do anything on my end to circumvent this?
Thanks!
what kind of tasks are you using? (functions, notebooks), and are you changing any settings in the executor?
I'm only using functions. I use these settings:
executor:
dotted_path: ploomber.executors.Serial
build_in_subprocess: false
Quite basic pipeline: Load data, wrangle it, fit a model & save it, make predictions with model.
There is also and env.yaml
file with configurations.
this might be a Python problem.
Python does not offer much flexibility for managing memory manually. One quick thing you can do is to set build_in_process
to True
. This way Ploomber will shut down the process running the function and release all the memory.
If you keep it as False
, then you might need to release memory manually. For example, here are some tips on releasing memory used by pandas: https://stackoverflow.com/questions/39100971/how-do-i-release-memory-used-by-a-pandas-dataframe
Let me try that option -- thanks!
I'm exclusively using Polars for ETL (only switching to Pandas when modelling packages require it). I don't know if that makes a difference.
cool, in such case, you might wanna look at polars documentation, maybe there is something about memory management.