kjerk / instructblip-pipeline

A multimodal inference pipeline that integrates InstructBLIP with textgen-webui for Vicuna and related models.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Implementing outside of oobabooga

unoriginalscreenname opened this issue · comments

I was super disappointed when I tried to run blip instruct out of the box and it just didn't work. I saw your comment on the LAVIS repo and really appreciated you trying to make this work!

I was wondering if this could be implemented or adapted outside of Oobabooga? I wasn't able to get it running in there to test it out.
I was getting an error and didn't spend a lot of time digging into it to be honest.

I'm working on an image captioning project for archival photos. I've implemented Salesforce/blip2-opt-2.7b locally and it's quite good! However, I would love to explore the use of llms, particularly the wizard models, to help add additional user guided detail to these images. The code to get blip2-opt running is amazingly trivial. The oobabooga api, however, has given me some problems.

Is it possible to implement blip-instruct outside of oobabooga without a lot of hassle? or are you piggybacking off of their exllama and other model loading infrastructure?

So I'm piggybacking off of ooba's existing multimodal pipeline implementation and GPTQ-for-llama already running on it. This way you can write an adapter (this repo) that embeds tokens and it does the rest, dealing with having a quantized model loader / inference.

It might be possible to implement this with llama.cpp but I haven't looked into it too deeply as primarily an oobabooga user.

I think huggingface just added quantized inference to transformers, so maybe it will be coming natively to transformers? (which is what salesforce implemented BLIP against)

Yeah, oobabooga does a lot of heavy lifting with all the quantized loading. It's such a mess. I'll look into if I can use the multimodal via the api. I haven't dug into that at all.

They fixed this on the hugging face side now. I was able to run it with their built in 4 or 8 bit method.

self.model = InstructBlipForConditionalGeneration.from_pretrained(
self.source_model,
device_map={"":0},
load_in_4bit=True,
torch_dtype=torch.bfloat16
)

Were you able to get this working with other models? The vicuna model is unstable and I find it will get stuck in a loop of generating nonsense sometimes.

They fixed this on the hugging face side now. I was able to run it with their built in 4 or 8 bit method.

self.model = InstructBlipForConditionalGeneration.from_pretrained( self.source_model, device_map={"":0}, load_in_4bit=True, torch_dtype=torch.bfloat16 )

Were you able to get this working with other models? The vicuna model is unstable and I find it will get stuck in a loop of generating nonsense sometimes.

Well that's good news, I'm glad it's not stuck only with AutoGPTQ.
In the Readme I listed multiple other models under the Tested Working Models section that also work.

Things that are in the familial line of Vicuna seem to work just fine, as long as the hidden state size matches (so wizard-vicuna didn't work) then the embeddings will work. nous-hermes-13b was pretty interesting.