Gemma tokenizer issue

Question

Gemma tokenizer issue

rudro opened this issue 4 months ago · comments

There seems to be Gemma support (https://github.com/ml-explore/mlx-swift-examples/blob/main/Libraries/LLM/Gemma.swift), however, the tokenizer library (when using llm-tool) throws a unsupportedTokenizer error as it has not been updated to support Gemma yet. Is there a way to use the Gemma model with a different tokenizer?

David Koski · Answer 1 · Sat Feb 24 2024 09:38:06 GMT+0800 (China Standard Time)

You could control which tokenizer was instantiated -- the code is here: https://github.com/ml-explore/mlx-swift-examples/blob/main/Libraries/LLM/Util.swift#L16

But I don't know if you can mix and match tokenizers like that. Ultimately I think it needs to be supported here: https://github.com/huggingface/swift-transformers

David Koski · Answer 2 · Sat Feb 24 2024 09:51:47 GMT+0800 (China Standard Time)

I looked and although the model is ported, all of the configurations on hugging face use GemmaTokenizer

David Koski · Answer 3 · Sat Feb 24 2024 09:55:05 GMT+0800 (China Standard Time)

I will look and see if a normal tokenizer can be used -- in Keras it looks like just a standard BPE tokenizer (I think).

Rudro Samanta · Answer 4 · Tue Feb 27 2024 00:47:46 GMT+0800 (China Standard Time)

I did try simply adding it as a BPE tokenizer by extending the list here : https://github.com/huggingface/swift-transformers/blob/main/Sources/Tokenizers/Tokenizer.swift#L251 but that did not work as it does seem like there are some real differences (you get some array out of bounds issues and such -- did not look deeply).

David Koski · Answer 5 · Tue Feb 27 2024 02:56:14 GMT+0800 (China Standard Time)

Yes, it looks like it fails handling this token: "▁ह ै"

The code does this:

        for (i, item) in merges.enumerated() {
            let tuple = item.split(separator: " ").map { String($0) }
            let bp = BytePair(tuple: tuple)

It expects to be able to split on a space but there is no space in this token.

David Koski · Answer 6 · Tue Feb 27 2024 03:00:47 GMT+0800 (China Standard Time)

Ah, but python thinks there is:

>>> s = "▁ह ै"
>>> s
'▁ह ै'
>>> s.split(" ")
['▁ह', 'ै']

Swift thinks not:

(lldb) po item.map { $0 }
▿ 3 elements
  - 0 : "▁"
  - 1 : "ह"
  - 2 : " ै"

https://unicodedecode.com also thinks there is a space:

David Koski · Answer 7 · Tue Feb 27 2024 05:31:06 GMT+0800 (China Standard Time)

OK, I added a commit where it will use a BPETokenizer load the Gemma saftensors file. I don't think the tokenizer is working correctly.

As for the splitting of tokens, swift-transformers needs to use String.UnicodeScalarView.split(separator:). I have a workaround where we discard the merge tokens that it will crash on for now, but ultimately it requires:

huggingface/swift-transformers#51

David Koski · Answer 8 · Tue Feb 27 2024 06:24:22 GMT+0800 (China Standard Time)

OK, this works better now -- the RMSNorm wasn't the correct implementation.

A few things to note: the python implementation generates a different prompt given the prompt "hello":

[2, 106, 1645, 108, 17534, 107, 108, 106, 2516, 108]
<bos><start_of_turn>user
hello<end_of_turn>
<start_of_turn>model

produces:

Hello! 👋 It's wonderful to hear from you. How can I assist you today?

Swift, given the same "hello" prompt produces these tokens:

[2, 17534]

which gives:

! 👋


I'm so happy to hear that you're interested in learning more about [topic]. I'm here to help you find the information you need and answer any questions you may have.


What would you like to learn more about today?<eos><eos>The context does not provide any information about what the topic is, so I cannot answer this question from the provided context.<eos><eos><eos>The context does not provide any information about what the topic is, so I cannot answer this

It should have stopped at <eos> but as noted in #4 the tokenizer doesn't expose that (I see what I can do about that).

If we use a prompt more like the python side: <start_of_turn>user hello<end_of_turn><start_of_turn>model we get these tokens:

[2, 235322, 2997, 235298, 559, 235298, 15508, 235313, 1645, 25612, 235322, 615, 235298, 559, 235298, 15508, 2577, 2997, 235298, 559, 235298, 15508, 235313, 2516]

and it generates:

Hello! How can I assist you today?<eos><eos>...

So you may have to adjust the prompt a bit to get what you want until we get a port of the Gemma tokenizer in swift-transformers.

David Koski · Answer 9 · Tue Feb 27 2024 06:25:21 GMT+0800 (China Standard Time)

And if I hard code the prompt tokens to match python (e.g. the numbers):

Hello! 👋 It's nice to hear from you. What can I do for you today? 😊<eos>

David Koski · Answer 10 · Tue Feb 27 2024 07:00:44 GMT+0800 (China Standard Time)

Added the eosTokenId -- 3f02fcc

David Koski · Answer 11 · Tue Feb 27 2024 07:01:08 GMT+0800 (China Standard Time)

I think this is working about as well as it can without the specialized tokenizer.

David Koski · Answer 12 · Tue Feb 27 2024 07:01:13 GMT+0800 (China Standard Time)

OK to close?

Rudro Samanta · Answer 13 · Tue Feb 27 2024 09:55:46 GMT+0800 (China Standard Time)

Yes, thanks so much for the quick response.