[Bug] internlm2 emit content with [UNUSED_TOKEN_145] at times

Question

[Bug] internlm2 emit content with [UNUSED_TOKEN_145] at times

gaord opened this issue 4 months ago · comments

Describe the bug

I am running quantized internlm2-chat-20b by llama.cpp with prompt template as in here. Chatting is going very quick well, however at times the model will produce [UNUSED_TOKEN_145].
When I change the stop word to "[UNUSED_TOKEN_145]" from <|im_end|>, every AI messages will be appended by additionally.

BTW, the quantization is done by workaround with rope scaling disabled.

It looks like the model's configuration exists bug to behave as this.

Environment

Mac m2 ultra
pytorch-lightning 2.1.0
torch 2.1.2
torchaudio 2.1.0
torchmetrics 1.2.0
torchvision 0.16.0

Other information

No response

RangiLyu · Answer 1 · Thu Jan 25 2024 10:12:01 GMT+0800 (China Standard Time)

This may be caused by the tokenizer config not being the latest version. Make sure the added_tokens_decoder in your tokenizer_config.json is the same with https://huggingface.co/internlm/internlm2-chat-20b/blob/main/tokenizer_config.json#L15

Please also ensure that the special token id mapping of the tokenizer converted using llama.cpp is consistent with this

{
      "<|plugin|>": 92538,
      "<|interpreter|>": 92539,
      "<|action_end|>": 92540,
      "<|action_start|>": 92541,
      "<|im_end|>": 92542,
      "<|im_start|>": 92543
}

RangiLyu · Answer 2 · Thu Jan 25 2024 10:22:47 GMT+0800 (China Standard Time)

In addition, the ending word of <eoh> in the picture is weird. Because in the two versions of chat templates before and after the update, the <eoh> ending word has never been used. It should be either [UNUSED_TOKEN_145] or <|im_end|>. Is it possible that this <eoh> is set somewhere else, causing the model to predict it because of few-shot learning? This is just my guess.

Wenwei Zhang · Answer 3 · Thu Jan 25 2024 19:35:25 GMT+0800 (China Standard Time)

What is your transformer version?

github-actions · Answer 4 · Fri Feb 02 2024 10:03:35 GMT+0800 (China Standard Time)

This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 7 days if the stale label is not removed or if there is no further response.

github-actions · Answer 5 · Sat Feb 10 2024 10:01:32 GMT+0800 (China Standard Time)

This issue is closed because it has been stale for 7 days. Please open a new issue if you have similar issues or you have any new updates now.