lyogavin / Anima

33B Chinese LLM, DPO QLORA, 100K context, AirLLM 70B inference with single 4GB GPU

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Optimize for consumer GPU, eg 11GB or 16GB

profintegra opened this issue · comments

I'm not sure it makes sense to load more than one layer from performance standpoint, but using 1.6GB out of 11GB/16GB of typical consumer GPU is not optimal (and super slow).

I've red on haggungface that it doesn't make sense to load more layers because only one layer is evaluated. But may be we can split into bigger chunks (several layers) and it can be done so that multiple layers are evaluated at once?

I can even try doing it by myself, will be nice to get a bit of guidance like: is it even feasible to do, where in the code the optimization should happen, etc