ml-explore / mlx-swift-examples

Examples using MLX Swift

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Gemma tokenizer issue

rudro opened this issue · comments

There seems to be Gemma support (https://github.com/ml-explore/mlx-swift-examples/blob/main/Libraries/LLM/Gemma.swift), however, the tokenizer library (when using llm-tool) throws a unsupportedTokenizer error as it has not been updated to support Gemma yet. Is there a way to use the Gemma model with a different tokenizer?

You could control which tokenizer was instantiated -- the code is here: https://github.com/ml-explore/mlx-swift-examples/blob/main/Libraries/LLM/Util.swift#L16

But I don't know if you can mix and match tokenizers like that. Ultimately I think it needs to be supported here: https://github.com/huggingface/swift-transformers

I looked and although the model is ported, all of the configurations on hugging face use GemmaTokenizer

I will look and see if a normal tokenizer can be used -- in Keras it looks like just a standard BPE tokenizer (I think).

I did try simply adding it as a BPE tokenizer by extending the list here : https://github.com/huggingface/swift-transformers/blob/main/Sources/Tokenizers/Tokenizer.swift#L251 but that did not work as it does seem like there are some real differences (you get some array out of bounds issues and such -- did not look deeply).

Yes, it looks like it fails handling this token: "▁ह ै"

The code does this:

        for (i, item) in merges.enumerated() {
            let tuple = item.split(separator: " ").map { String($0) }
            let bp = BytePair(tuple: tuple)

It expects to be able to split on a space but there is no space in this token.

Ah, but python thinks there is:

>>> s = "▁ह ै"
>>> s
'▁ह ै'
>>> s.split(" ")
['▁ह', 'ै']

Swift thinks not:

(lldb) po item.map { $0 }
▿ 3 elements
  - 0 : "▁"
  - 1 : "ह"
  - 2 : " ै"

https://unicodedecode.com also thinks there is a space:

image

OK, I added a commit where it will use a BPETokenizer load the Gemma saftensors file. I don't think the tokenizer is working correctly.

As for the splitting of tokens, swift-transformers needs to use String.UnicodeScalarView.split(separator:). I have a workaround where we discard the merge tokens that it will crash on for now, but ultimately it requires:

huggingface/swift-transformers#51

OK, this works better now -- the RMSNorm wasn't the correct implementation.

A few things to note: the python implementation generates a different prompt given the prompt "hello":

[2, 106, 1645, 108, 17534, 107, 108, 106, 2516, 108]
<bos><start_of_turn>user
hello<end_of_turn>
<start_of_turn>model

produces:

Hello! 👋 It's wonderful to hear from you. How can I assist you today?

Swift, given the same "hello" prompt produces these tokens:

[2, 17534]

which gives:

! 👋


I'm so happy to hear that you're interested in learning more about [topic]. I'm here to help you find the information you need and answer any questions you may have.


What would you like to learn more about today?<eos><eos>The context does not provide any information about what the topic is, so I cannot answer this question from the provided context.<eos><eos><eos>The context does not provide any information about what the topic is, so I cannot answer this

It should have stopped at <eos> but as noted in #4 the tokenizer doesn't expose that (I see what I can do about that).

If we use a prompt more like the python side: <start_of_turn>user hello<end_of_turn><start_of_turn>model we get these tokens:

[2, 235322, 2997, 235298, 559, 235298, 15508, 235313, 1645, 25612, 235322, 615, 235298, 559, 235298, 15508, 2577, 2997, 235298, 559, 235298, 15508, 235313, 2516]

and it generates:

Hello! How can I assist you today?<eos><eos>...

So you may have to adjust the prompt a bit to get what you want until we get a port of the Gemma tokenizer in swift-transformers.

And if I hard code the prompt tokens to match python (e.g. the numbers):

Hello! 👋 It's nice to hear from you. What can I do for you today? 😊<eos>

Added the eosTokenId -- 3f02fcc

I think this is working about as well as it can without the specialized tokenizer.

OK to close?

Yes, thanks so much for the quick response.