target_prefix latency

Question

target_prefix latency

SimonBenhamou opened this issue 3 months ago · comments

Hello,

I noticed that when supplying a target_prefix to the translate_batch or generate_tokens method, the latencies for generating the supplied tokens is equivalent to the situation where they are not provided, while I would expect negligible latency because those tokens don't require any generation steps. I'm expecting the first step to be the generation of the token after the prefix tokens.

Am I missing something, or is this due to an inefficiency in ctranslate2's generation logic ?

Thanks,
Simon

Minh-Thuc · Answer 1 · Thu May 02 2024 18:28:52 GMT+0800 (China Standard Time)

If you specified the target_prefix, it would decode once in a step then generate one by one with the next steps. Without target_prefix, it would generate one by one token. In theory, it have to run faster in case of using target_prefix. Could you test with a long prefix ?

SimonBenhamou · Answer 2 · Fri May 03 2024 05:50:31 GMT+0800 (China Standard Time)

I did, and could reproduce the fact that

no matter how long the prefix, the generation time is the same
when using the generate_token method and measuring the latency, the generation time is the same for prefix tokens than for the subsequent tokens