ploomber / ploomber

The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️

Home Page:https://docs.ploomber.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Memory consumption

robertdj opened this issue · comments

Hi!

I am running into memory issues when using Ploomber on large(-ish) datasets. Kinda hard to reproduce, so I'll start with a description. Let me know if more is needed.

When running a Ploomber pipeline the memory consumption of the Python process is increasing throughout -- as if does not release memory after finishing a step.
Sometimes the Python process is killed with an "Out of Memory" exception during a step.
If I then re-run the pipeline (such that it starts with the step where it was just killed) it works w/o errors.

Is this a behavior that others recognize? Can I do anything on my end to circumvent this?

Thanks!

what kind of tasks are you using? (functions, notebooks), and are you changing any settings in the executor?

I'm only using functions. I use these settings:

executor:
  dotted_path: ploomber.executors.Serial
  build_in_subprocess: false 

Quite basic pipeline: Load data, wrangle it, fit a model & save it, make predictions with model.

There is also and env.yaml file with configurations.

this might be a Python problem.

Python does not offer much flexibility for managing memory manually. One quick thing you can do is to set build_in_process to True. This way Ploomber will shut down the process running the function and release all the memory.

If you keep it as False, then you might need to release memory manually. For example, here are some tips on releasing memory used by pandas: https://stackoverflow.com/questions/39100971/how-do-i-release-memory-used-by-a-pandas-dataframe

Let me try that option -- thanks!

I'm exclusively using Polars for ETL (only switching to Pandas when modelling packages require it). I don't know if that makes a difference.

cool, in such case, you might wanna look at polars documentation, maybe there is something about memory management.