belladoreai / llama-tokenizer-js

JS tokenizer for LLaMA 1 and 2

Home Page:https://belladoreai.github.io/llama-tokenizer-js/example-demo/build/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CJK language

linonetwo opened this issue · comments

Seems they are not properly parsed.

Hey, thanks for reporting the issue! Did you find the problem with the library itself or just the demosite?

I am very unfamiliar with these languages, but I did try to reproduce the issue.

First I took a random sentence in Chinese: 每个人都有他的作战策略

Next I used the library to encode the sentence:

llamaTokenizer.encode("每个人都有他的作战策略")
> [1, 29871, 31951, 30502, 30313, 30769, 30417, 31221, 30210, 30732, 233, 139, 155, 234, 176, 153, 234, 152, 168]

Then I used the library to decode the tokens to see if I get the same sentence back:

llamaTokenizer.decode([1, 29871, 31951, 30502, 30313, 30769, 30417, 31221, 30210, 30732, 233, 139, 155, 234, 176, 153, 234, 152, 168])
> '每个人都有他的作战策略'

That seemed to work fine.

Then I compared the tokens from llama-tokenizer-js to corresponding tokenization from oobabooga-web-ui API (running manticore). The tokenids were the same, so that seemed to work fine too.

Based on this quick test, I think the library itself is ok and you are reporting an issue concering the demosite? If I input to the text field in the demo site, I can see an output of 5 tokens. The first two tokens are special and related to the beginning of input. The last 3 tokens are the result of tokenizing the character. The tokens are correctly represented as 3 distinct colors, because they are 3 tokens, not 1. There is no good ASCII representation for these tokens, though, which means they are rendered as question marks (or possibly squares depending on your platform). I'm guessing this is what you are reporting?

Sorry, so this might only happen on the demo site. This is like the ? mark pokemon.

As long as this doesn't affect the production usage, I will close this issue.

Thanks for your explanation.

I improved the demo site now. If you input you will see <0xE7><0x95><0xA5> which is a nicer way of rendering those tokens than the previous ???.

Thanks, it works very well now.