build_token_enforcer_tokenizer_data takes too long on some tokenizer
gsolard opened this issue · comments
Gautier Solard commented
When using the function build_token_enforcer_tokenizer_data on the bloom tokenizer (https://huggingface.co/bigscience/bloom), it takes a really long time to finish.
The problem comes from this loop :
Indeed one decoded token in the bloom tokenizer have length 600 and several have a length superior to 100 (compared to 16 maximum for llama for example). So the double loop is really long to complete
Noam Gat commented
A PR that will soon be merged allows limiting the max string length in this
section, will improve performance
…On Fri, Feb 16, 2024, 13:16 Gautier Solard ***@***.***> wrote:
When using the function build_token_enforcer_tokenizer_data on the bloom
tokenizer, it takes a really long time to finish.
The problem comes from this loop :
https://github.com/noamgat/lm-format-enforcer/blob/7cfb693495e9ba3305e230cc62e05b383b6c717a/lmformatenforcer/tokenizerprefixtree.py#L72
Indeed one decoded token in the bloom tokenizer have length 600 and
several have a length superior to 100 (compared to 16 maximum for llama for
example). So the double loop is really long to complete
—
Reply to this email directly, view it on GitHub
<#73>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKFA2AQ6KXB6I5JD42GPXTYT45XBAVCNFSM6AAAAABDL3M4LOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGEZTQMZZGEYDQMQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>