Batch job in for each loop Gpu memory issue

Question

Batch job in for each loop Gpu memory issue

simonhs87 opened this issue 5 months ago · comments

Hi!
I'm running a LLM with 7B parameters on aws batch using the @Batch decorator.

When i run a prediction task on the llm with one batch of my for_each loop, it works.
But if i spawn the full for each loop i run into a GPU OOM, which is most likely due to batch hosting multiple containers on the same instance. Is there any way to control (via metaflow), that it spawns a separate instance for each job in the loop?

Cheers,
simon

SHS · Answer 1 · Fri Dec 22 2023 15:47:01 GMT+0800 (China Standard Time)

Closing, since GPU memory provided is as per instance. Error can be explained by slight difference in input token size.