CJK language

Question

CJK language

linonetwo opened this issue a year ago · comments

lin onetwo commented a year ago

Seems they are not properly parsed.

Belladore · Answer 1 · Wed Jun 14 2023 23:50:14 GMT+0800 (China Standard Time)

Hey, thanks for reporting the issue! Did you find the problem with the library itself or just the demosite?

I am very unfamiliar with these languages, but I did try to reproduce the issue.

First I took a random sentence in Chinese: 每个人都有他的作战策略

Next I used the library to encode the sentence:

llamaTokenizer.encode("每个人都有他的作战策略")
> [1, 29871, 31951, 30502, 30313, 30769, 30417, 31221, 30210, 30732, 233, 139, 155, 234, 176, 153, 234, 152, 168]

Then I used the library to decode the tokens to see if I get the same sentence back:

llamaTokenizer.decode([1, 29871, 31951, 30502, 30313, 30769, 30417, 31221, 30210, 30732, 233, 139, 155, 234, 176, 153, 234, 152, 168])
> '每个人都有他的作战策略'

That seemed to work fine.

Then I compared the tokens from llama-tokenizer-js to corresponding tokenization from oobabooga-web-ui API (running manticore). The tokenids were the same, so that seemed to work fine too.

Based on this quick test, I think the library itself is ok and you are reporting an issue concering the demosite? If I input 略 to the text field in the demo site, I can see an output of 5 tokens. The first two tokens are special and related to the beginning of input. The last 3 tokens are the result of tokenizing the 略 character. The tokens are correctly represented as 3 distinct colors, because they are 3 tokens, not 1. There is no good ASCII representation for these tokens, though, which means they are rendered as question marks (or possibly squares depending on your platform). I'm guessing this is what you are reporting?

lin onetwo · Answer 2 · Thu Jun 15 2023 14:40:49 GMT+0800 (China Standard Time)

Sorry, so this might only happen on the demo site. This is like the ? mark pokemon.

As long as this doesn't affect the production usage, I will close this issue.

lin onetwo · Answer 3 · Thu Jun 15 2023 14:46:52 GMT+0800 (China Standard Time)

Thanks for your explanation.

Belladore · Answer 4 · Thu Jun 15 2023 23:58:43 GMT+0800 (China Standard Time)

I improved the demo site now. If you input 略 you will see <0xE7><0x95><0xA5> which is a nicer way of rendering those tokens than the previous ???.

lin onetwo · Answer 5 · Fri Jun 16 2023 00:50:00 GMT+0800 (China Standard Time)

Thanks, it works very well now.