Gemma tokenizer issue
rudro opened this issue · comments
There seems to be Gemma support (https://github.com/ml-explore/mlx-swift-examples/blob/main/Libraries/LLM/Gemma.swift), however, the tokenizer library (when using llm-tool
) throws a unsupportedTokenizer
error as it has not been updated to support Gemma yet. Is there a way to use the Gemma model with a different tokenizer?
You could control which tokenizer was instantiated -- the code is here: https://github.com/ml-explore/mlx-swift-examples/blob/main/Libraries/LLM/Util.swift#L16
But I don't know if you can mix and match tokenizers like that. Ultimately I think it needs to be supported here: https://github.com/huggingface/swift-transformers
I looked and although the model is ported, all of the configurations on hugging face use GemmaTokenizer
I will look and see if a normal tokenizer can be used -- in Keras it looks like just a standard BPE tokenizer (I think).
I did try simply adding it as a BPE tokenizer by extending the list here : https://github.com/huggingface/swift-transformers/blob/main/Sources/Tokenizers/Tokenizer.swift#L251 but that did not work as it does seem like there are some real differences (you get some array out of bounds issues and such -- did not look deeply).
Yes, it looks like it fails handling this token: "▁ह ै"
The code does this:
for (i, item) in merges.enumerated() {
let tuple = item.split(separator: " ").map { String($0) }
let bp = BytePair(tuple: tuple)
It expects to be able to split on a space but there is no space in this token.
Ah, but python thinks there is:
>>> s = "▁ह ै"
>>> s
'▁ह ै'
>>> s.split(" ")
['▁ह', 'ै']
Swift thinks not:
(lldb) po item.map { $0 }
▿ 3 elements
- 0 : "▁"
- 1 : "ह"
- 2 : " ै"
https://unicodedecode.com also thinks there is a space:
![image](https://private-user-images.githubusercontent.com/46639364/307919494-e128b6c3-ed9e-4577-81a5-160dd8544c39.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTkxNjQyMjksIm5iZiI6MTcxOTE2MzkyOSwicGF0aCI6Ii80NjYzOTM2NC8zMDc5MTk0OTQtZTEyOGI2YzMtZWQ5ZS00NTc3LTgxYTUtMTYwZGQ4NTQ0YzM5LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MjMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjIzVDE3MzIwOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTY3ZGUwNGY4Y2E0OGViNjY0OWRkNTJjNjgxYjY0YWE5YjVhNThmN2ZlOGI3ODI5YWFkOTg4NDZhMmYyZmNiMWYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.2kfNzx9On3YyDcB7pQJPFKKQAyqBXmAd45zZguuJ7EA)
OK, I added a commit where it will use a BPETokenizer load the Gemma saftensors file. I don't think the tokenizer is working correctly.
As for the splitting of tokens, swift-transformers
needs to use String.UnicodeScalarView.split(separator:)
. I have a workaround where we discard the merge tokens that it will crash on for now, but ultimately it requires:
OK, this works better now -- the RMSNorm wasn't the correct implementation.
A few things to note: the python implementation generates a different prompt given the prompt "hello":
[2, 106, 1645, 108, 17534, 107, 108, 106, 2516, 108]
<bos><start_of_turn>user
hello<end_of_turn>
<start_of_turn>model
produces:
Hello! 👋 It's wonderful to hear from you. How can I assist you today?
Swift, given the same "hello" prompt produces these tokens:
[2, 17534]
which gives:
! 👋
I'm so happy to hear that you're interested in learning more about [topic]. I'm here to help you find the information you need and answer any questions you may have.
What would you like to learn more about today?<eos><eos>The context does not provide any information about what the topic is, so I cannot answer this question from the provided context.<eos><eos><eos>The context does not provide any information about what the topic is, so I cannot answer this
It should have stopped at <eos>
but as noted in #4 the tokenizer doesn't expose that (I see what I can do about that).
If we use a prompt more like the python side: <start_of_turn>user hello<end_of_turn><start_of_turn>model
we get these tokens:
[2, 235322, 2997, 235298, 559, 235298, 15508, 235313, 1645, 25612, 235322, 615, 235298, 559, 235298, 15508, 2577, 2997, 235298, 559, 235298, 15508, 235313, 2516]
and it generates:
Hello! How can I assist you today?<eos><eos>...
So you may have to adjust the prompt a bit to get what you want until we get a port of the Gemma tokenizer in swift-transformers.
And if I hard code the prompt tokens to match python (e.g. the numbers):
Hello! 👋 It's nice to hear from you. What can I do for you today? 😊<eos>
Added the eosTokenId
-- 3f02fcc
I think this is working about as well as it can without the specialized tokenizer.
OK to close?
Yes, thanks so much for the quick response.