MDK8888 / GPTFast

Accelerate your Hugging Face Transformers 7.6-9x. Native to Hugging Face and PyTorch.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Help to understand

apirrone opened this issue · comments

Hi!

I don't quite understand how this project works, I guess my main question is : what is a draft model ?

For example, I would like to speed-up the inference of OwlVit (https://huggingface.co/google/owlvit-base-patch32) which I use through the transformers library. Can I do that with GPTFast ?

Thanks !

Hey, apologies for the late response! Vision Transformers are not used for text generation, so they are not supported at this moment.

As for your question about speculative decoding, it essentially uses a smaller draft model to predict the outputs of the larger model. Hope that this is helpful!

Hi, same question.
What exactly is the model_draft_name?
If I would like to speedup llama 7b, and I have stored the checkpoints in my local let's say in folder name 'llama-7b' then what should be my model_name and draft_model_name?

Thanks

Having the same issue. For example, this just loads the zephyr model twice so not sure if I am using this properly:

...
model_name = "HuggingFaceH4/zephyr-7b-beta"
draft_model_name = "HuggingFaceH4/zephyr-7b-beta"

tokenizer = AutoTokenizer.from_pretrained(model_name)
initial_string = "Write me a short story."
input_tokens = tokenizer.encode(initial_string, return_tensors="pt").to(device)

N_ITERS = 10
MAX_TOKENS = 50

gpt_fast_model = gpt_fast(model_name, draft_model_name=draft_model_name, sample_function=argmax)
gpt_fast_model.to(device)

@MDK8888

Take a look at this video, where Horace He explains the concept nicely:
https://www.youtube.com/watch?v=18YupYsH5vY&t=1935s

In essence, you would want to use a smaller model that is an order of magnitude faster and let it run for a number of steps and then check its results in parallel with the big model (which costs almost the same as one token, because it is memory bound by the weights).

@andreas-solti thanks for the response. I watched the video and understood that it is basically the smaller model which should be relatively fast, so the actual model only has to select the best as explained in the video.
The condition I could understand is both model should be using same tokenizer. The smaller could be the quantized version of the original mode.
Also, as explain in his video he mentioned Llama7B and TinyLlama (as smaller model). So when I tried it I faced errors and couldn't run it. I tried couple of other models combinations. Only gpt2-xl and gpt2 are working fine as mentioned in this repo.

So I really need to understand what's the criteria of selecting the smaller/ draft model.
Thanks!

Yes, you need a draft model that is compatible with the tokens, but as Horace mentioned in his presentation, you are also free to use somthing else that is able to predict the next tokens fast. He said it is possible to also use a trigram model for example (in theory you can use any model, but you'd have to convert the output to the token ids that is compatible with what you need in your big model.

Think of the example: you have the prompt-context "Explain what a rainbow is." and the model is in the decoding phase now. Then, if your small model predicts "A rainbow is a beautiful phenomenon...", you can feed the corresponding tokens into your big model in parallel. Then you check, if the ouputs also correspond to the next predicted token. If it does not, then you need to recompute based on the generated output of the big model.

Good luck. Also try asking ChatGPT with your error messages that you get.