token usage
jiangts opened this issue · comments
maybe this is a dumb question, but if you decompose the schema into K queries to extract values in parallel, don't you have to repeat the text input K times? Or do you have a way around this?
Great question!
You do have to repeat the text K times, but the important part is that all of the queries will be batched into as few forward passes as possible.
If you're paying per token (e.g. using OpenAI), you will pay more for using this technique. Some people are happy to pay more for lower latencies. If you're using open source models hosted on your own GPUs, this incurs no additional cost (you are likely memory bandwidth bound and therefore have unutilized FLOPs). You get lower latencies for free.