Netflix / metaflow

:rocket: Build and manage real-life ML, AI, and data science projects with ease!

Home Page:https://metaflow.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Batch job in for each loop Gpu memory issue

simonhs87 opened this issue · comments

commented

Hi!
I'm running a LLM with 7B parameters on aws batch using the @Batch decorator.

When i run a prediction task on the llm with one batch of my for_each loop, it works.
But if i spawn the full for each loop i run into a GPU OOM, which is most likely due to batch hosting multiple containers on the same instance. Is there any way to control (via metaflow), that it spawns a separate instance for each job in the loop?

Cheers,
simon

commented

Closing, since GPU memory provided is as per instance. Error can be explained by slight difference in input token size.