noamgat / lm-format-enforcer

When using the function build_token_enforcer_tokenizer_data on the bloom tokenizer (https://huggingface.co/bigscience/bloom), it takes a really long time to finish.

The problem comes from this loop :

lm-format-enforcer/lmformatenforcer/tokenizerprefixtree.py

Line 72 in 7cfb693

for min_remaining in range(self.max_token_len + 1):

Indeed one decoded token in the bloom tokenizer have length 600 and several have a length superior to 100 (compared to 16 maximum for llama for example). So the double loop is really long to complete

A PR that will soon be merged allows limiting the max string length in this section, will improve performance

…

On Fri, Feb 16, 2024, 13:16 Gautier Solard ***@***.***> wrote: When using the function build_token_enforcer_tokenizer_data on the bloom tokenizer, it takes a really long time to finish. The problem comes from this loop : https://github.com/noamgat/lm-format-enforcer/blob/7cfb693495e9ba3305e230cc62e05b383b6c717a/lmformatenforcer/tokenizerprefixtree.py#L72 Indeed one decoded token in the bloom tokenizer have length 600 and several have a length superior to 100 (compared to 16 maximum for llama for example). So the double loop is really long to complete — Reply to this email directly, view it on GitHub <#73>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKFA2AQ6KXB6I5JD42GPXTYT45XBAVCNFSM6AAAAABDL3M4LOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGEZTQMZZGEYDQMQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

build_token_enforcer_tokenizer_data takes too long on some tokenizer