noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

build_token_enforcer_tokenizer_data takes too long on some tokenizer

gsolard opened this issue · comments

When using the function build_token_enforcer_tokenizer_data on the bloom tokenizer (https://huggingface.co/bigscience/bloom), it takes a really long time to finish.

The problem comes from this loop :

for min_remaining in range(self.max_token_len + 1):

Indeed one decoded token in the bloom tokenizer have length 600 and several have a length superior to 100 (compared to 16 maximum for llama for example). So the double loop is really long to complete