microsoft / onnxruntime-genai

Generative AI extensions for onnxruntime

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TokenizerStream returns incomplete strings

jeremyfowers opened this issue · comments

The definition of the TokenizerStream.decode() method is that "If a displayable string has accumulated, this method returns it. If not, this method returns the empty string."

However, I am seeing it return a lot of incomplete strings. Here is an example output:

 is
 a
 line
 from
 a
 popular
 rom
antic
 song
 that

This is a strange output for two reasons:

  1. the word "romantic" is broken up into two strings, "rom" and "antic"
  2. there is a leading space in front of most strings

Is this the expected behavior? Is my app supposed to, while streaming text to my user, parse this back into a stream of whole words?

From my past experience with Huggingface's TextIteratorStreamer I would expect whole words with no leading spaces.

A few things going on here from what I can tell (guessing based on the pattern without looking at your code):

  • By default, the python print statement will add a newline at the end of your print message to stdout. This is the reason you see the tokens on a new line. Try using print(tokenizer_stream.decode(new_token), end='', flush=True)
  • The leading space is a token itself. It is an output of the tokenizer_stream.decode(new_token) call.
  • The word, "romantic", seems to be broken down to the two tokens rom and antic. This most likely comes from the vocab in use. Which is why you do not see the leading space on that line.

Using print(tokenizer_stream.decode(new_token), end='', flush=True) should help for all the 3 points above.

Hi @baijumeswani, thanks for the quick reply! I agree that your solution would work well if I was just trying to print streaming text to the screen.

However, my use case is to stream the words over a WebSocket (using FastAPI) to a client. The client code is expecting the text to be in the same format used by Huggingface's TextIteratorStreamer, since we developed against that standard. I am now trying to port our server to use ORT-GenAI instead of Huggingface Transformers.

The leading space is a token itself. It is an output of the tokenizer_stream.decode(new_token) call.

This is not what I am observing. In my example in the description, " is", " a", " line", etc. are all return values from tokenizer_stream.decode(new_token). The leading space is not a discrete return value.

TextIteratorStreamer does not have this problem; it returns just the words, and it is safe to join the words with spaces on the client.

The word, "romantic", seems to be broken down to the two tokens rom and antic. This most likely comes from the vocab in use. Which is why you do not see the leading space on that line.

How do I know if a word is incomplete or not before I send it over my websocket? Do you have any helper code that determines whether a tokenizer_stream.decode(new_token) is complete or not? TextIteratorStreamer does not have this problem - even if it produces incomplete strings internally, it only returns complete strings to the user.

Any help you can provide with either of these differences from TextIteratorStreamer would be very helpful!

The leading space is a token itself. It is an output of the tokenizer_stream.decode(new_token) call.

This is not what I am observing. In my example in the description, " is", " a", " line", etc. are all return values from tokenizer_stream.decode(new_token). The leading space is not a discrete return value.

I think I was mistaken. I am guessing the token contains the leading space in this case.

How do I know if a word is incomplete or not before I send it over my websocket? Do you have any helper code that determines whether a tokenizer_stream.decode(new_token) is complete or not? TextIteratorStreamer does not have this problem - even if it produces incomplete strings internally, it only returns complete strings to the user.

Could you point me to the relevant documentation for TextIteratorStreamer that suggests that output text can be joined with spaces? From this API doc, it seems like you can iterate over it. But the example does not use any " " while combining the texts into generated_text.

TextIteratorStreamer does not have this problem; it returns just the words, and it is safe to join the words with spaces on the client.

How do you deal with scenarios where space does not make sense. For example, in between the last word of a sentence and a punctuation .

Here is some example output from TextIteratorStreamer:

nah, 
there 
is 
no 
legal 
obligation 
on 
a 
truck 
driver 
to 
drive 
off

It has two major properties different from tokenizer_stream:

  1. No leading or trailing spaces on any return value. It just returns the words.
  2. Long words like "obligation" are in a single return value, not broken up across two return values.

This is the behavior I am trying to achieve. I oversimplified a bit with the "join with spaces", but you can see the way the client app has to parse the result must be quite different between TextIteratorStreamer and tokenizer_stream because they format their return values so differently.

commented

I would be very surprised if this was not expected behavior. I have not seen an open source model yet that doesn't do it this way. This is why we consider the token/word ratio to be around 1.5 tokens per word.

I am curious what model you are using today that streams only whole words and not tokens.

And I'm also betting you are adding in extra white space on your current implementation, and it's just invisible to you (because HTML ignores multiple spaces in a row, and only renders a single space).

I've been using LLaMA-2-7b and Phi-3-Mini and I'm sure you're right that internally those models are streaming ~1.5 tokens/word.

However, my question is not about what the model is internally doing; rather, I am asking about whether OG's streaming decode is expected to behave similarly to Huggingface's.

It seems as though Huggingface's is doing some internal buffering to make sure that it only returns complete words, while OG's is not expected to do the same kind of internal buffering.