Memory consumption

Question

Memory consumption

robertdj opened this issue a year ago · comments

Robert Dahl Jacobsen commented a year ago

Hi!

I am running into memory issues when using Ploomber on large(-ish) datasets. Kinda hard to reproduce, so I'll start with a description. Let me know if more is needed.

When running a Ploomber pipeline the memory consumption of the Python process is increasing throughout -- as if does not release memory after finishing a step.
Sometimes the Python process is killed with an "Out of Memory" exception during a step.
If I then re-run the pipeline (such that it starts with the step where it was just killed) it works w/o errors.

Is this a behavior that others recognize? Can I do anything on my end to circumvent this?

Thanks!

Eduardo Blancas · Answer 1 · Thu May 11 2023 22:15:26 GMT+0800 (China Standard Time)

what kind of tasks are you using? (functions, notebooks), and are you changing any settings in the executor?

Robert Dahl Jacobsen · Answer 2 · Sat May 13 2023 15:30:52 GMT+0800 (China Standard Time)

I'm only using functions. I use these settings:

executor:
  dotted_path: ploomber.executors.Serial
  build_in_subprocess: false

Quite basic pipeline: Load data, wrangle it, fit a model & save it, make predictions with model.

There is also and env.yaml file with configurations.

Eduardo Blancas · Answer 3 · Thu May 18 2023 00:03:04 GMT+0800 (China Standard Time)

this might be a Python problem.

Python does not offer much flexibility for managing memory manually. One quick thing you can do is to set build_in_process to True. This way Ploomber will shut down the process running the function and release all the memory.

If you keep it as False, then you might need to release memory manually. For example, here are some tips on releasing memory used by pandas: https://stackoverflow.com/questions/39100971/how-do-i-release-memory-used-by-a-pandas-dataframe

Robert Dahl Jacobsen · Answer 4 · Fri May 19 2023 18:39:02 GMT+0800 (China Standard Time)

Let me try that option -- thanks!

I'm exclusively using Polars for ETL (only switching to Pandas when modelling packages require it). I don't know if that makes a difference.

Eduardo Blancas · Answer 5 · Sat May 20 2023 00:19:56 GMT+0800 (China Standard Time)

cool, in such case, you might wanna look at polars documentation, maybe there is something about memory management.