Batch job in for each loop Gpu memory issue
simonhs87 opened this issue · comments
SHS commented
Hi!
I'm running a LLM with 7B parameters on aws batch using the @Batch decorator.
When i run a prediction task on the llm with one batch of my for_each loop, it works.
But if i spawn the full for each loop i run into a GPU OOM, which is most likely due to batch hosting multiple containers on the same instance. Is there any way to control (via metaflow), that it spawns a separate instance for each job in the loop?
Cheers,
simon
SHS commented
Closing, since GPU memory provided is as per instance. Error can be explained by slight difference in input token size.