flm的tokenizer和原始tokenizer分词结果不一致

Question

flm的tokenizer和原始tokenizer分词结果不一致

yiguanxian opened this issue 7 months ago · comments

chatglm2和baichuan2都有这个问题。

模型编译方式
from fastllm_pytools import llm
from transformers import AutoTokenizer, AutoModel

hf_model = "/workspace/chatglm2-6B"

flm_dtype = "int8"
model_name = hf_model.split("/")[-1]
flm_model = f"/workspace/models/{model_name}-fastllm-{flm_dtype}.flm"

tokenizer = AutoTokenizer.from_pretrained(hf_model, trust_remote_code=True)
model = AutoModel.from_pretrained(hf_model, trust_remote_code=True).half().cuda()
model = llm.from_hf(model, tokenizer, dtype=flm_dtype)
model.save(flm_model)

测试代码
prompt_input = "[Round 1]"

from transformers import AutoTokenizer
model_path = "/workspace/chatglm2-6B"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
print(f"src prompt: {prompt_input}, token id: {tokenizer.encode(prompt_input)}") # R ound

import fastllm
model_path = "/workspace/models/chatglm2-6B-fastllm-int8.flm"
model = fastllm.create_llm(model_path)
input_ids = model.weight.tokenizer.encode(prompt_input)
input_ids = input_ids.to_list()
input_ids = [int(v) for v in input_ids]
print(f"fastllm prompt: {prompt_input}, token id: {input_ids}") # Ro und
3.测试结果
原始的会将Round这个单词分成"R"和"ound"，而flm会将它分成 "Ro"和"und"。另外在百川2上输入"你是可爱"，原始的会将其分成"你是"和"可爱"，而flm转出来的baichuan2会将其分成" 你"，"是可"，"爱”

TylunasLi · Answer 1 · Thu Jan 11 2024 01:05:01 GMT+0800 (China Standard Time)

chatglm3的问题是 model.save()没保存SentencePiece token权重导致的，使用torch2flm.toFile()时无此问题。已作了修复。