token usage

Question

token usage

jiangts opened this issue 2 years ago · comments

maybe this is a dumb question, but if you decompose the schema into K queries to extract values in parallel, don't you have to repeat the text input K times? Or do you have a way around this?

Varun Shenoy · Answer 1 · Thu Feb 08 2024 01:53:16 GMT+0800 (China Standard Time)

Great question!

You do have to repeat the text K times, but the important part is that all of the queries will be batched into as few forward passes as possible.

If you're paying per token (e.g. using OpenAI), you will pay more for using this technique. Some people are happy to pay more for lower latencies. If you're using open source models hosted on your own GPUs, this incurs no additional cost (you are likely memory bandwidth bound and therefore have unutilized FLOPs). You get lower latencies for free.